Practical guide

Implement AIWebIndex in a weekend.

This page is operational, not normative. For binding requirements read the specification; this guide covers how to actually build something conformant.


Overview

A minimum-viable AIWebIndex implementation needs five things:

  1. An HTTP client that sends the AIWebIndex User-Agent.
  2. A robots.txt parser that respects rules for AIWebIndex.
  3. An HTML extractor that produces an AIDocument shape.
  4. A verification flow (DNS TXT or .well-known) for site ownership.
  5. An origin rate-limit (default 2 seconds between hits to the same domain).

Everything else (storage, queuing, dashboards, billing) is your product surface. The protocol does not require any of it.

1. Send the User-Agent

Every protocol-driven fetch carries a stable User-Agent so site operators can allowlist (or block) you predictably. Use AIWebIndex/1.0as the prefix and append your implementation’s identifier.

curl

curl -H 'User-Agent: AIWebIndex/1.0 (+https://example.com/bot; my-impl)' \
     'https://example.com/article'

Python (httpx)

import httpx

UA = "AIWebIndex/1.0 (+https://example.com/bot; my-impl)"

with httpx.Client(headers={"User-Agent": UA}, timeout=15.0) as client:
    r = client.get("https://example.com/article", follow_redirects=True)
    r.raise_for_status()
    html = r.text
    final_url = str(r.url)

Node (fetch)

const UA = "AIWebIndex/1.0 (+https://example.com/bot; my-impl)";

const res = await fetch("https://example.com/article", {
  redirect: "follow",
  headers: { "User-Agent": UA },
});
if (!res.ok) throw new Error(`upstream ${res.status}`);
const html = await res.text();
const finalUrl = res.url;

2. Honor robots.txt

Before fetching, check https://<domain>/robots.txt for rules addressed to your User-Agent token (User-agent: AIWebIndex) or the wildcard (User-agent: *). Most languages have a parser available; the one in the Python standard library (urllib.robotparser) is sufficient.

A site that wants to block you publishes:

User-agent: AIWebIndex
Disallow: /

A site that wants to explicitly allow you publishes:

User-agent: AIWebIndex
Allow: /

User-agent: *
Disallow:

3. Extract an AIDocument

Take the fetched HTML and produce the JSON envelope described in spec section 4. The minimum required fields are: url, title, markdown, headings, links, meta, and crawl.

Field-by-field:

  • url: the URL after redirects (r.url in httpx; res.url in fetch).
  • title: prefer og:title, then the first h1, then <title>.
  • markdown: strip boilerplate (nav, footer, ads) and convert the main content. Tools like trafilatura (Python) or @mozilla/readability (Node) handle this well.
  • headings: walk the DOM, collect h1-h6 in document order with their levels.
  • links: collect <a href> elements; mark internal: true when the link host equals the page host.
  • meta.language: prefer html[lang]; otherwise detect.
  • crawl.fetched_at: an RFC 3339 timestamp (e.g., 2026-05-10T12:34:56Z).

See spec section 4.3 for a complete example.

4. Verify domain ownership

When a site owner registers their domain with your implementation, generate a verification token (at least 128 bits of entropy) and ask them to publish either:

DNS TXT (recommended)

_aiwebindex-verify.<their-domain>  TXT  "aiwi-verify=<token>"

Then resolve from a public resolver:

# python (dnspython)
import dns.resolver

answers = dns.resolver.resolve(
    f"_aiwebindex-verify.{domain}", "TXT"
)
for rdata in answers:
    if f'aiwi-verify={token}' in str(rdata):
        verified = True

.well-known file

# their server returns:
GET https://<their-domain>/.well-known/aiwebindex-verify.txt

# response body:
aiwi-verify=<token>

Use HTTPS only. Reject plain-HTTP fetches even if they redirect to HTTPS later. Per spec 5.2, cross-domain redirects do not count.

5. Rate-limit by origin

Track the last fetch time per canonical hostname; refuse to fetch the same hostname within 2 seconds of the last attempt. A simple in-memory map keyed by hostname is enough for most implementations:

const lastFetch = new Map<string, number>();

async function fetchWithCooldown(url: URL) {
  const host = url.hostname;
  const last = lastFetch.get(host) ?? 0;
  const wait = 2000 - (Date.now() - last);
  if (wait > 0) await new Promise(r => setTimeout(r, wait));
  lastFetch.set(host, Date.now());
  return fetch(url, { headers: { "User-Agent": UA } });
}

Honor Crawl-delay in robots.txt where the value exceeds your default. Honor HTTP 429 Too Many Requests and 503 Service Unavailable as described in spec 6.3.

Conformance checklist

Before shipping, walk through this list. If you can answer “yes” to each item, your implementation conforms to AIWebIndex 1.0.

  • Every protocol-driven fetch sends a User-Agent starting with AIWebIndex/1.0.
  • robots.txt rules for User-agent: AIWebIndex are honored.
  • Wildcard User-agent: * rules apply when no agent-specific block is present.
  • The HTTP Crawl-delay directive is respected when greater than the default per-origin cooldown.
  • HTTP 429 and 503 responses pause fetches per spec section 6.3.
  • AIDocument responses include all required fields per spec section 4.1.
  • At least one of DNS TXT or .well-known verification is implemented end-to-end.
  • Verification fetches use HTTPS only and reject cross-domain redirects.
  • No User-Agent spoofing or bypass of authenticated content.
  • No end-user identifiers leak into the AIDocument or the outbound request.

Reference implementation

The Lyrenth stack at www.lyrenth.com is the reference: a Go API, a Postgres-backed crawl queue, a chromedp-based renderer for SPA pages, and a Next.js dashboard for site owners. Source for the protocol-relevant bits is on GitHub. Use it to compare behavior. But the spec is what you implement against, not Lyrenth specifically.

Get listed once shipped

When your implementation is live, email hello@aiwebindex.org with a link and a one-paragraph summary. Conforming implementations are listed on /implementations at no charge.


Read the full spec →See live implementations