Implement AIWebIndex in a weekend.
This page is operational, not normative. For binding requirements read the specification; this guide covers how to actually build something conformant.
Overview
A minimum-viable AIWebIndex implementation needs five things:
- An HTTP client that sends the AIWebIndex User-Agent.
- A robots.txt parser that respects rules for
AIWebIndex. - An HTML extractor that produces an AIDocument shape.
- A verification flow (DNS TXT or .well-known) for site ownership.
- An origin rate-limit (default 2 seconds between hits to the same domain).
Everything else (storage, queuing, dashboards, billing) is your product surface. The protocol does not require any of it.
1. Send the User-Agent
Every protocol-driven fetch carries a stable User-Agent so site operators can allowlist (or block) you predictably. Use AIWebIndex/2.0as the prefix and append your implementation’s identifier.
curl
curl -H 'User-Agent: AIWebIndex/2.0 (+https://example.com/bot; my-impl)' \
'https://example.com/article'Python (httpx)
import httpx
UA = "AIWebIndex/2.0 (+https://example.com/bot; my-impl)"
with httpx.Client(headers={"User-Agent": UA}, timeout=15.0) as client:
r = client.get("https://example.com/article", follow_redirects=True)
r.raise_for_status()
html = r.text
final_url = str(r.url)Node (fetch)
const UA = "AIWebIndex/2.0 (+https://example.com/bot; my-impl)";
const res = await fetch("https://example.com/article", {
redirect: "follow",
headers: { "User-Agent": UA },
});
if (!res.ok) throw new Error(`upstream ${res.status}`);
const html = await res.text();
const finalUrl = res.url;2. Honor robots.txt
Before fetching, check https://<domain>/robots.txt for rules addressed to your User-Agent token (User-agent: AIWebIndex) or the wildcard (User-agent: *). Most languages have a parser available; the one in the Python standard library (urllib.robotparser) is sufficient.
A site that wants to block you publishes:
User-agent: AIWebIndex
Disallow: /A site that wants to explicitly allow you publishes:
User-agent: AIWebIndex
Allow: /
User-agent: *
Disallow:3. Extract an AIDocument (2.0 grouped envelope)
Take the fetched HTML and produce the JSON envelope described in spec section 4. 2.0 groups fields under semantic blocks rather than the flat layout used by 1.0. The required top-level groups are: schema, source, cache, identity, content, structure, and signals. economicsis optional. If you’re porting from a 1.0 implementation, see spec section 4.4 for the field-by-field rename map.
Per-group, in the order you’ll typically build them:
- schema:
{ name: "AIDocument", version: "2.0" }is the minimum. The optionalrefis a content- addressed identifier you compute once over the non-volatile fields; agents can use it to detect unchanged pages across re-crawls without diffing the body. - source.url: the URL after redirects (
r.urlin httpx;res.urlin fetch). - source.freshness_policy: the policy the CALLER requested (
cache_firstorforce_refresh). If your implementation has no caching layer, default to"force_refresh". - source.fetched_at: an RFC 3339 timestamp (e.g.,
2026-05-13T12:34:56Z). - source.render_mode:
"static"when you served HTML straight off the wire;"rendered"if you ran a headless browser. - cache: at minimum
{ status: "miss", origin_contacted: true, body_fetched: true }when serving from a fresh fetch. Implementations without a cache always emit"miss"or"refreshed". - identity.title: prefer
og:title, then the firsth1, then<title>. - identity.language: prefer
html[lang]; otherwise detect. - content.markdown: strip boilerplate (nav, footer, ads) and convert the main content. Tools like
trafilatura(Python) or@mozilla/readability(Node) handle this well. - structure.headings: walk the DOM, collect
h1-h6in document order with their levels. - structure.links: collect
<a href>elements; markinternal: truewhen the link host equals the page host. - signals.has_json_ld: true if you found any
<script type="application/ld+json">blocks on the page. - signals.heading_hierarchy_ok: true if there is at least one heading, the first is h1 or h2, and no adjacent levels jump by more than 1.
See spec section 4.3 for a complete example response.
4. Verify domain ownership
When a site owner registers their domain with your implementation, generate a verification token (at least 128 bits of entropy) and ask them to publish either:
DNS TXT (recommended)
_aiwebindex-verify.<their-domain> TXT "aiwi-verify=<token>"Then resolve from a public resolver:
# python (dnspython)
import dns.resolver
answers = dns.resolver.resolve(
f"_aiwebindex-verify.{domain}", "TXT"
)
for rdata in answers:
if f'aiwi-verify={token}' in str(rdata):
verified = True.well-known file
# their server returns:
GET https://<their-domain>/.well-known/aiwebindex-verify.txt
# response body:
aiwi-verify=<token>Use HTTPS only. Reject plain-HTTP fetches even if they redirect to HTTPS later. Per spec 5.2, cross-domain redirects do not count.
5. Rate-limit by origin
Track the last fetch time per canonical hostname; refuse to fetch the same hostname within 2 seconds of the last attempt. A simple in-memory map keyed by hostname is enough for most implementations:
const lastFetch = new Map<string, number>();
async function fetchWithCooldown(url: URL) {
const host = url.hostname;
const last = lastFetch.get(host) ?? 0;
const wait = 2000 - (Date.now() - last);
if (wait > 0) await new Promise(r => setTimeout(r, wait));
lastFetch.set(host, Date.now());
return fetch(url, { headers: { "User-Agent": UA } });
}Honor Crawl-delay in robots.txt where the value exceeds your default. Honor HTTP 429 Too Many Requests and 503 Service Unavailable as described in spec 6.3.
Conformance checklist
Before shipping, walk through this list. If you can answer “yes” to each item, your implementation conforms to AIWebIndex 2.0.
- Every protocol-driven fetch sends a User-Agent starting with
AIWebIndex/2.0. - robots.txt rules for
User-agent: AIWebIndexare honored. - Wildcard
User-agent: *rules apply when no agent-specific block is present. - The HTTP
Crawl-delaydirective is respected when greater than the default per-origin cooldown. - HTTP
429and503responses pause fetches per spec section 6.3. - AIDocument responses include all seven required top-level groups (
schema,source,cache,identity,content,structure,signals) per spec section 4.1, andschema.versionmatches the version your implementation conforms to (currently2.0). - At least one of DNS TXT or .well-known verification is implemented end-to-end.
- Verification fetches use HTTPS only and reject cross-domain redirects.
- No User-Agent spoofing or bypass of authenticated content.
- No end-user identifiers leak into the AIDocument or the outbound request.
Reference implementation
The Lyrenth stack at www.lyrenth.com is the reference: a Go API, a Postgres-backed crawl queue, a chromedp-based renderer for SPA pages, and a Next.js dashboard for site owners. Source for the protocol-relevant bits is on GitHub. Use it to compare behavior. But the spec is what you implement against, not Lyrenth specifically.
Get listed once shipped
When your implementation is live, email hello@aiwebindex.org with a link and a one-paragraph summary. Conforming implementations are listed on /implementations at no charge.