Implement AIWebIndex in a weekend.
This page is operational, not normative. For binding requirements read the specification; this guide covers how to actually build something conformant.
Overview
A minimum-viable AIWebIndex implementation needs five things:
- An HTTP client that sends the AIWebIndex User-Agent.
- A robots.txt parser that respects rules for
AIWebIndex. - An HTML extractor that produces an AIDocument shape.
- A verification flow (DNS TXT or .well-known) for site ownership.
- An origin rate-limit (default 2 seconds between hits to the same domain).
Everything else (storage, queuing, dashboards, billing) is your product surface. The protocol does not require any of it.
1. Send the User-Agent
Every protocol-driven fetch carries a stable User-Agent so site operators can allowlist (or block) you predictably. Use AIWebIndex/1.0as the prefix and append your implementation’s identifier.
curl
curl -H 'User-Agent: AIWebIndex/1.0 (+https://example.com/bot; my-impl)' \
'https://example.com/article'Python (httpx)
import httpx
UA = "AIWebIndex/1.0 (+https://example.com/bot; my-impl)"
with httpx.Client(headers={"User-Agent": UA}, timeout=15.0) as client:
r = client.get("https://example.com/article", follow_redirects=True)
r.raise_for_status()
html = r.text
final_url = str(r.url)Node (fetch)
const UA = "AIWebIndex/1.0 (+https://example.com/bot; my-impl)";
const res = await fetch("https://example.com/article", {
redirect: "follow",
headers: { "User-Agent": UA },
});
if (!res.ok) throw new Error(`upstream ${res.status}`);
const html = await res.text();
const finalUrl = res.url;2. Honor robots.txt
Before fetching, check https://<domain>/robots.txt for rules addressed to your User-Agent token (User-agent: AIWebIndex) or the wildcard (User-agent: *). Most languages have a parser available; the one in the Python standard library (urllib.robotparser) is sufficient.
A site that wants to block you publishes:
User-agent: AIWebIndex
Disallow: /A site that wants to explicitly allow you publishes:
User-agent: AIWebIndex
Allow: /
User-agent: *
Disallow:3. Extract an AIDocument
Take the fetched HTML and produce the JSON envelope described in spec section 4. The minimum required fields are: url, title, markdown, headings, links, meta, and crawl.
Field-by-field:
- url: the URL after redirects (
r.urlin httpx;res.urlin fetch). - title: prefer
og:title, then the firsth1, then<title>. - markdown: strip boilerplate (nav, footer, ads) and convert the main content. Tools like
trafilatura(Python) or@mozilla/readability(Node) handle this well. - headings: walk the DOM, collect
h1-h6in document order with their levels. - links: collect
<a href>elements; markinternal: truewhen the link host equals the page host. - meta.language: prefer
html[lang]; otherwise detect. - crawl.fetched_at: an RFC 3339 timestamp (e.g.,
2026-05-10T12:34:56Z).
See spec section 4.3 for a complete example.
4. Verify domain ownership
When a site owner registers their domain with your implementation, generate a verification token (at least 128 bits of entropy) and ask them to publish either:
DNS TXT (recommended)
_aiwebindex-verify.<their-domain> TXT "aiwi-verify=<token>"Then resolve from a public resolver:
# python (dnspython)
import dns.resolver
answers = dns.resolver.resolve(
f"_aiwebindex-verify.{domain}", "TXT"
)
for rdata in answers:
if f'aiwi-verify={token}' in str(rdata):
verified = True.well-known file
# their server returns:
GET https://<their-domain>/.well-known/aiwebindex-verify.txt
# response body:
aiwi-verify=<token>Use HTTPS only. Reject plain-HTTP fetches even if they redirect to HTTPS later. Per spec 5.2, cross-domain redirects do not count.
5. Rate-limit by origin
Track the last fetch time per canonical hostname; refuse to fetch the same hostname within 2 seconds of the last attempt. A simple in-memory map keyed by hostname is enough for most implementations:
const lastFetch = new Map<string, number>();
async function fetchWithCooldown(url: URL) {
const host = url.hostname;
const last = lastFetch.get(host) ?? 0;
const wait = 2000 - (Date.now() - last);
if (wait > 0) await new Promise(r => setTimeout(r, wait));
lastFetch.set(host, Date.now());
return fetch(url, { headers: { "User-Agent": UA } });
}Honor Crawl-delay in robots.txt where the value exceeds your default. Honor HTTP 429 Too Many Requests and 503 Service Unavailable as described in spec 6.3.
Conformance checklist
Before shipping, walk through this list. If you can answer “yes” to each item, your implementation conforms to AIWebIndex 1.0.
- Every protocol-driven fetch sends a User-Agent starting with
AIWebIndex/1.0. - robots.txt rules for
User-agent: AIWebIndexare honored. - Wildcard
User-agent: *rules apply when no agent-specific block is present. - The HTTP
Crawl-delaydirective is respected when greater than the default per-origin cooldown. - HTTP
429and503responses pause fetches per spec section 6.3. - AIDocument responses include all required fields per spec section 4.1.
- At least one of DNS TXT or .well-known verification is implemented end-to-end.
- Verification fetches use HTTPS only and reject cross-domain redirects.
- No User-Agent spoofing or bypass of authenticated content.
- No end-user identifiers leak into the AIDocument or the outbound request.
Reference implementation
The Lyrenth stack at www.lyrenth.com is the reference: a Go API, a Postgres-backed crawl queue, a chromedp-based renderer for SPA pages, and a Next.js dashboard for site owners. Source for the protocol-relevant bits is on GitHub. Use it to compare behavior. But the spec is what you implement against, not Lyrenth specifically.
Get listed once shipped
When your implementation is live, email hello@aiwebindex.org with a link and a one-paragraph summary. Conforming implementations are listed on /implementations at no charge.