AIWebIndex 2.0
Status of this document §
This document is a draft of the AIWebIndex 2.0 specification, published by Aleksma AI Inc. as steward of the protocol. It describes a stable interface intended for adoption by AI systems, web crawlers, and site operators.
Implementations conforming to this specification are interoperable. Aleksma AI Inc.holds no patents on the protocol’s core mechanics and pledges not to seek such patents. Comments, questions, and proposed revisions may be sent to hello@aiwebindex.org.
Abstract §
AIWebIndex defines a small, focused protocol for AI-readable web indexing. It specifies (a) a User-Agent identifier crawlers MUST send when fetching content as part of an AIWebIndex workflow, (b) a JSON envelope, AIDocument, that interoperable implementations return, and (c) a verification mechanism that allows a site operator to prove ownership of a domain to a registrar of AIWebIndex implementations.
1. Introduction §
1.1 Background §
AI systems increasingly fetch web content on behalf of end users. They face two practical problems: (1) origin servers cannot tell them apart from generic browsers or search crawlers, which makes allowlisting, rate-limiting, and abuse handling unreliable; and (2) every implementation invents its own way of representing extracted content, fragmenting downstream tooling.
AIWebIndex addresses both: a single User-Agent token any compliant crawler sends, and a single JSON shape any compliant implementation returns. The protocol is intentionally minimal. It does not standardize ranking, retrieval, or storage. Those decisions remain with each implementation.
1.2 Goals §
- Give site operators a single, stable identity to allowlist or block when an AI system reads their content.
- Give consumers (AI agents, downstream tools) a single document shape they can rely on regardless of which AIWebIndex implementation produced it.
- Give site owners a domain-verification mechanism that does not depend on any single implementation’s account system.
- Stay implementable in a weekend. The protocol covers what is shared across implementations; everything else is out of scope.
1.3 Non-goals §
- Ranking. AIWebIndex is silent on which pages an AI system should prefer; that is a product decision.
- Storage. The protocol does not require persistent caching, an index, or any specific data lifetime.
- Authentication. AIWebIndex does not standardize end-user auth between an AI system and an origin; per-implementation API keys and OAuth flows are out of scope.
- Pricing. The protocol is free of fees by construction. Implementations may charge their own customers as they see fit.
2. Conformance §
The key words MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in this document are to be interpreted as described in RFC 2119, when, and only when, they appear in all capitals.
An implementation conforms to AIWebIndex 2.0 if it satisfies all the MUST-level requirements in sections 3, 4, and 5 of this document.
3. User-Agent identifier §
Conforming crawlers MUST send an HTTP User-Agent header that begins with the string AIWebIndex/2.0 when fetching content as part of a protocol-driven workflow.
The full User-Agent value SHOULD include a URL where site operators can read about the implementation. A conformant value looks like:
User-Agent: AIWebIndex/2.0 (+https://example.com/bot; <implementation-name>)A separate verification fetch (Section 5) SHOULD use a User-Agent value with the suffix verification appended so site operators can distinguish on-demand crawls from ownership checks:
User-Agent: AIWebIndex/2.0 verification (+https://example.com/bot)Implementations MUST NOT impersonate other User-Agent strings (browsers, search crawlers, generic libraries) in lieu of the AIWebIndex identifier when performing protocol-driven fetches.
4. AIDocument format §
4.1 Top-level structure §
An AIDocument is a JSON object representing the structured extraction of a single URL. Conformant 2.0 implementations MUST return objects with the following top-level groups when describing a fetched page:
{
"schema": { /* SchemaInfo */ }, // required
"source": { /* SourceInfo */ }, // required
"cache": { /* CacheInfo */ }, // required
"identity": { /* IdentityInfo */ }, // required
"content": { /* ContentInfo */ }, // required
"structure": { /* StructureInfo */ }, // required
"signals": { /* SignalsInfo */ }, // required
"economics": { /* EconomicsInfo */ } // optional
}Top-level groups are a stable contract. Removing a group or renaming a key requires a new major version. Adding a new optional group (or a new optional field inside an existing group) within 2.x is permitted.
The grouped layout is the headline change between AIWebIndex 1.0 and 2.0. The motivation is that AI consumers can route on the block they care about (identity vs. structure vs. cost) without parsing the entire envelope, and the boundaries between “what we fetched” (source) and “what happened on this call” (cache) are explicit instead of implied. See Section 4.4 for a field-by-field migration map from 1.0.
4.2 Field definitions §
schema (required)
Identifies the shape and version of the response so consumers can route by version.
- name (string, required). MUST be the literal
"AIDocument". - version (string, required). The spec major.minor version the response conforms to. Conformant 2.0 implementations emit
"2.0"; conformant 2.x revisions emit the corresponding minor (e.g."2.1"). - ref (string, optional). An opaque, content-addressed identifier of the document fingerprint, prefixed with
aidoc:. The reference implementation usesaidoc:sha256:<32 hex chars>(a 128-bit prefix of a SHA-256 over the non-volatile fields: url, canonical_url, title, description, language, content_type, markdown, headings, links, images, structured_data). The identifier is stable across cache states and re-crawls of an unchanged page.
source (required)
Describes what was fetched and how. Independent of cache state: even on a cache hit, these fields describe the underlying snapshot (which may be older than the current call).
- url (string, required). The URL the document represents, after any HTTP redirects. MUST be a fully-qualified absolute URL.
- canonical_url (string, optional). The value of the
<link rel="canonical">tag if present in the source page; otherwise omitted. - fetched_at (string, optional). RFC 3339 timestamp of the underlying snapshot. Omitted on documents that predate the field.
- render_mode (string, optional). How the page was rendered. One of
"static"(direct HTTP fetch),"rendered"(JS execution was required), or"static_after_render_failure"(renderer attempted but fell back to the static HTML). - status_code (integer, optional). HTTP status from the origin at fetch time.
- freshness_policy (string, required). The policy the CALLER requested for this call. One of
"cache_first"or"force_refresh". The outcome lives incache.status.
cache (required)
Describes what happened on the implementation side for this specific call. The status string is a coarse-grained label; the two booleans are the precise truth.
- status (string, required). One of
"hit"(cached snapshot served, origin not contacted),"miss"(cache_firstpolicy, no cache hit, body fetched from origin),"refreshed"(force_refreshpolicy, body fetched from origin), or"stale_revalidated"(origin returned304 Not Modified, no body). - origin_contacted (boolean, required).
- body_fetched (boolean, required).
identity (required)
Page-level identity metadata.
- title (string, optional). The page title, taken from
og:title, the firsth1, or the<title>tag, in that preference order. - description (string, optional). Page description from
meta[name=description]orog:description. - language (string, optional). BCP 47 language tag (e.g.
en,en-US,fr). - content_type (string, optional). Implementation-classified page-content type (e.g.
article,product,listing,profile).
content (required)
The cleaned page body.
- markdown (string, required). Cleaned, structured markdown of the main page content. Boilerplate (navigation, footers, ads) SHOULD be removed. Implementations MAY use any extraction algorithm.
structure (required)
Extracted page structure. Every field is optional; a single-paragraph article may legitimately have no links and no images.
- headings (array, optional). Headings in document order. Each entry has
{ level: number, text: string, id?: string }. Levels are 1-6 corresponding toh1-h6. - links (array, optional). Outbound
<a href>elements. Each entry has{ url, text?, internal: boolean, rel? }. Theinternalfield is true if the link target’s host equals the source page’s host. - images (array, optional).
<img>elements with { url, alt? }. - structured_data (object, optional). Merged JSON-LD payload from
<script type="application/ld+json">blocks. Keys depend on the page and are not enumerated by this spec.
signals (required)
Per-call quality projection derived from the AIDocument structure. Cheap, deterministic, no implementation-side state lookup. The two booleans are the spec floor any conformant 2.0 implementation can compute; integer counts are RECOMMENDED but optional.
- word_count (integer, optional). Word count of the cleaned markdown.
- reading_time (integer, optional). Estimated reading time in minutes.
- has_json_ld (boolean, required). True if the page declared any
application/ld+jsonstructured data. - heading_hierarchy_ok (boolean, required). True if there is at least one heading, the first is h1 or h2, and no adjacent levels jump by more than 1.
economics (optional)
Optional cost-savings projection for AI consumers: compares the tokens used by the cleaned markdown against the tokens an LLM would have consumed processing the raw HTML directly. Implementations MAY omit this group entirely; if present, all listed fields are required.
- output_tokens_approx (integer). Approximate token count of
content.markdownunder a generic LLM tokenizer. - raw_html_tokens_approx (integer). Approximate token count of the raw HTML body.
- token_savings (integer).
raw_html_tokens_approx - output_tokens_approx. - token_savings_percent (number).
(token_savings / raw_html_tokens_approx) * 100. - estimated_cost_usd (object). Object with
our_output,raw_html, andsavingsin USD, computed against the model described inpricing_basis. - pricing_basis (object). Object with
input_price_per_1k_usd(number),model_class(string, e.g."mid-tier"), and optionalnote(string). Lets consumers recompute the math against their own model rates.
4.3 Example §
{
"schema": {
"name": "AIDocument",
"version": "2.0",
"ref": "aidoc:sha256:8a93c5f24b1e7d0c3f9a8b2e6d4c1f0e"
},
"source": {
"url": "https://example.com/article",
"canonical_url": "https://example.com/article",
"fetched_at": "2026-05-13T12:34:56Z",
"render_mode": "static",
"status_code": 200,
"freshness_policy": "cache_first"
},
"cache": {
"status": "miss",
"origin_contacted": true,
"body_fetched": true
},
"identity": {
"title": "How HTTP works",
"description": "A friendly introduction to HTTP request/response.",
"language": "en",
"content_type": "article"
},
"content": {
"markdown": "# How HTTP works\n\nWhen a client..."
},
"structure": {
"headings": [
{ "level": 1, "text": "How HTTP works" },
{ "level": 2, "text": "Requests" }
],
"links": [
{ "url": "https://www.rfc-editor.org/rfc/rfc7230",
"text": "RFC 7230",
"internal": false }
]
},
"signals": {
"word_count": 1240,
"reading_time": 5,
"has_json_ld": false,
"heading_hierarchy_ok": true
}
}4.4 Migration from 1.0 to 2.0 §
2.0 reshapes the AIDocument envelope: 1.0’s flat field set is grouped under semantic blocks (schema, source, cache, identity, content, structure, signals, optional economics). The rename map for every 1.0 field follows.
1.0 path 2.0 path
--------------------------- ---------------------------
url source.url
canonical_url source.canonical_url
title identity.title
description identity.description
markdown content.markdown
headings structure.headings
links structure.links
images structure.images
structured_data structure.structured_data
meta.language identity.language
meta.word_count signals.word_count
meta.reading_time signals.reading_time
crawl.fetched_at source.fetched_at
crawl.status_code source.status_code
crawl.render_mode source.render_mode
crawl.fetch_duration_ms (dropped; not portable)
crawl.content_length (dropped; not portable)
crawl.user_agent (dropped; redundant with the request UA)Fields new in 2.0:
schemablock. Self-describing version identifier + content-addressedref. Lets consumers route by version without inferring the shape.cacheblock. Per-call cache state (hit / miss / refreshed / stale_revalidated), explicitly separated from the snapshot fields undersource.source.freshness_policy. The caller’s requested policy, distinct from the cache outcome.identity.content_type. Implementation-classified page type.signals.has_json_ldandsignals.heading_hierarchy_ok. Boolean quality floors guaranteed on every response.economicsblock. Optional token + USD savings projection.
Implementer guidance. 1.0 implementations remain valid 1.0 implementations. There is no requirement to migrate. New implementations SHOULD target 2.0. Consumers reading AIDocument responses SHOULD discover the version via schema.versionrather than path-presence detection, so future 2.x additions don’t break parsers. Mixed consumers that need to read both 1.0 and 2.0 responses MAY distinguish them by the presence of a top-level schema object (2.0+) vs. top-level url (1.0).
5. Verification mechanism §
A site owner proves they control a domain by responding to a verification token. Conformant implementations MUST support at least one of the two methods below; implementations SHOULD support both.
5.1 DNS TXT record §
The implementation generates a verification token and instructs the owner to publish a DNS TXT record under _aiwebindex-verify.<domain> with the value aiwi-verify=<token>:
_aiwebindex-verify.example.com TXT "aiwi-verify=8a93c5f2..."The implementation MUST query at least one authoritative DNS resolver for this record. The implementation SHOULD query multiple resolvers (e.g., 1.1.1.1, 8.8.8.8, 9.9.9.9) in parallel to tolerate misconfigured local resolvers.
5.2 .well-known file §
The owner publishes a plain-text file at https://<domain>/.well-known/aiwebindex-verify.txt whose body contains aiwi-verify=<token>:
# https://example.com/.well-known/aiwebindex-verify.txt
aiwi-verify=8a93c5f2...The implementation MUST fetch the file over HTTPS. Plain HTTP fetches MUST be rejected. The implementation SHOULD follow up to 3 redirects within the same registrable domain; cross-domain redirects MUST NOT count as a successful verification.
6. Crawler behavior §
6.1 robots.txt compliance §
Conforming crawlers MUST honor robots.txt directives addressed to AIWebIndex as the User-Agent. Wildcard rules (User-agent: *) MUST apply when no agent-specific block is present.
6.2 Rate limiting §
Implementations SHOULD enforce a per-origin cooldown between consecutive fetches of the same domain. 2 seconds is RECOMMENDED as the minimum default; implementations MAY raise this for sites that request a longer Crawl-delay in robots.txt.
6.3 Backoff §
Implementations MUST respect HTTP 429 Too Many Requests responses by pausing fetches to that origin for at least the duration of the Retry-After header, or 60 seconds if absent. Implementations MUST respect 503 Service Unavailable the same way.
7. Security considerations §
Verification tokens. Implementations MUST generate verification tokens with at least 128 bits of entropy and SHOULD rotate them on every issuance. A token previously issued for a domain MUST NOT be reused for a different domain without explicit user action.
HTTPS only for fetch. Verification fetches (Section 5.2) MUST use HTTPS. Crawl fetches SHOULD prefer HTTPS where the origin advertises it.
Origin spoofing. Implementations MUST NOT fabricate User-Agent strings to imitate browsers or search crawlers in order to bypass site-operator decisions.
Authenticated content. Implementations MUST NOT attempt to bypass paywalls, login walls, or technical access controls. The protocol applies only to publicly accessible content.
8. Privacy considerations §
End-user identity. The protocol carries no field for end-user identification. Implementations MUST NOTembed end-user identifiers (IP addresses, cookies, account IDs of the agent’s caller) into either the request or the AIDocument response.
PII in extracted content. Source pages may contain personally-identifying information published by their authors. The protocol does not require implementations to remove or transform such content; downstream consumers are responsible for compliance with applicable data-protection law.
Verification record visibility. DNS TXT records and HTTP files used for verification (Section 5) are public. Site owners SHOULD use opaque tokens, not human identifiers, in those records.
9. References §
RFC 2119: Key words for use in RFCs to indicate requirement levels. rfc-editor.org/rfc/rfc2119
RFC 3339: Date and Time on the Internet: Timestamps. rfc-editor.org/rfc/rfc3339
RFC 9309: Robots Exclusion Protocol. rfc-editor.org/rfc/rfc9309
BCP 47: Tags for Identifying Languages. rfc-editor.org/info/bcp47
schema.org: Structured-data vocabulary. schema.org
Version history §
2.0 (2026-05-13). Breaking change to Section 4 (AIDocument format). The flat field set from 1.0 is regrouped under semantic blocks: schema, source, cache, identity, content, structure, signals, optional economics.schema introduces a self-describing version identifier and a content-addressed ref; cache separates per-call cache state from snapshot metadata; signals guarantees two boolean quality floors (has_json_ld, heading_hierarchy_ok) on every response; economics exposes an optional token + USD savings projection. Three 1.0 crawl fields are dropped (fetch_duration_ms, content_length, user_agent) on the grounds that they are not portable across implementations. Section 4.4 documents the 1.0 → 2.0 migration field-by-field. Sections 1, 2, 3, 5, 6, 7, 8, and 9 are unchanged from 1.0.
1.0 (2026-05-10). Initial publication. Flat AIDocument envelope (url, title, markdown, headings, links, meta, crawl). Remains a valid implementation target for systems already deployed against it; new implementations should target 2.0.