Specification

AIWebIndex 1.0

DraftPublished 2026-05-10Editor: Aleksma AI Inc.

Status of this document §

This document is a draft of the AIWebIndex 1.0 specification, published by Aleksma AI Inc. as steward of the protocol. It describes a stable interface intended for adoption by AI systems, web crawlers, and site operators.

Implementations conforming to this specification are interoperable. Aleksma AI Inc.holds no patents on the protocol’s core mechanics and pledges not to seek such patents. Comments, questions, and proposed revisions may be sent to hello@aiwebindex.org.

Abstract §

AIWebIndex defines a small, focused protocol for AI-readable web indexing. It specifies (a) a User-Agent identifier crawlers MUST send when fetching content as part of an AIWebIndex workflow, (b) a JSON envelope, AIDocument, that interoperable implementations return, and (c) a verification mechanism that allows a site operator to prove ownership of a domain to a registrar of AIWebIndex implementations.

1. Introduction §

1.1 Background §

AI systems increasingly fetch web content on behalf of end users. They face two practical problems: (1) origin servers cannot tell them apart from generic browsers or search crawlers, which makes allowlisting, rate-limiting, and abuse handling unreliable; and (2) every implementation invents its own way of representing extracted content, fragmenting downstream tooling.

AIWebIndex addresses both: a single User-Agent token any compliant crawler sends, and a single JSON shape any compliant implementation returns. The protocol is intentionally minimal. It does not standardize ranking, retrieval, or storage. Those decisions remain with each implementation.

1.2 Goals §

Give site operators a single, stable identity to allowlist or block when an AI system reads their content.
Give consumers (AI agents, downstream tools) a single document shape they can rely on regardless of which AIWebIndex implementation produced it.
Give site owners a domain-verification mechanism that does not depend on any single implementation’s account system.
Stay implementable in a weekend. The protocol covers what is shared across implementations; everything else is out of scope.

1.3 Non-goals §

Ranking. AIWebIndex is silent on which pages an AI system should prefer; that is a product decision.
Storage. The protocol does not require persistent caching, an index, or any specific data lifetime.
Authentication. AIWebIndex does not standardize end-user auth between an AI system and an origin; per-implementation API keys and OAuth flows are out of scope.
Pricing. The protocol is free of fees by construction. Implementations may charge their own customers as they see fit.

2. Conformance §

The key words MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in this document are to be interpreted as described in RFC 2119, when, and only when, they appear in all capitals.

An implementation conforms to AIWebIndex 1.0 if it satisfies all the MUST-level requirements in sections 3, 4, and 5 of this document.

3. User-Agent identifier §

Conforming crawlers MUST send an HTTP User-Agent header that begins with the string AIWebIndex/1.0 when fetching content as part of a protocol-driven workflow.

The full User-Agent value SHOULD include a URL where site operators can read about the implementation. A conformant value looks like:

User-Agent: AIWebIndex/1.0 (+https://example.com/bot; <implementation-name>)

A separate verification fetch (Section 5) SHOULD use a User-Agent value with the suffix verification appended so site operators can distinguish on-demand crawls from ownership checks:

User-Agent: AIWebIndex/1.0 verification (+https://example.com/bot)

Implementations MUST NOT impersonate other User-Agent strings (browsers, search crawlers, generic libraries) in lieu of the AIWebIndex identifier when performing protocol-driven fetches.

4. AIDocument format §

4.1 Top-level structure §

An AIDocument is a JSON object representing the structured extraction of a single URL. Conformant implementations MUST return objects with the following top-level fields when describing a fetched page:

{
  "url":          "string",          // the FINAL URL after redirects
  "canonical_url":"string",          // optional; rel="canonical" if present
  "title":        "string",
  "description":  "string",          // optional
  "markdown":     "string",          // cleaned content
  "headings":     [/* Heading[] */], // h1-h6 in document order
  "links":        [/* Link[] */],    // outbound links
  "images":       [/* Image[] */],   // optional
  "meta":         {/* MetaData */},
  "structured_data": {/* JSON-LD if present */},
  "crawl":        {/* CrawlInfo */}
}

Field names and JSON shapes are stable. Additive changes (new optional fields) are permitted within a major version; removals or renames require a new major version.

4.2 Field definitions §

url (string, required). The URL the document represents, after any HTTP redirects. MUST be a fully-qualified absolute URL.

canonical_url (string, optional). The value of the <link rel="canonical"> tag if present in the source page; otherwise omitted.

title (string, required). The page title, taken from og:title, the first h1, or the <title> tag, in that preference order.

description (string, optional). Page description from meta[name=description] or og:description.

markdown (string, required). Cleaned, structured markdown of the main page content. Boilerplate (navigation, footers, ads) SHOULD be removed. Implementations MAY use any extraction algorithm.

headings (array, required). Headings in document order. Each entry has { level: number, text: string, id?: string }. Levels are 1-6 corresponding to h1-h6.

links (array, required). Outbound <a href> elements. Each entry has { url, text?, internal: boolean, rel? }. The internalfield is true if the link target’s host equals the source page’s host.

images (array, optional). <img> elements with { url, alt? }.

meta (object, required). Derived metadata: at minimum language (BCP 47), word_count (integer), reading_time (minutes, integer). Implementations MAY include additional fields (author, site_name, keywords, og_image, published, modified).

structured_data (object, optional). JSON-LD blocks extracted from the source page, normalized to a single object whose keys are schema.org type names.

crawl (object, required). Information about how and when the document was fetched: fetched_at (RFC 3339 timestamp), status_code (HTTP integer), render_mode ("static" or "rendered"), fetch_duration_ms (integer), content_length (integer bytes), user_agent (string).

4.3 Example §

{
  "url": "https://example.com/article",
  "canonical_url": "https://example.com/article",
  "title": "How HTTP works",
  "description": "A friendly introduction to HTTP request/response.",
  "markdown": "# How HTTP works\n\nWhen a client...",
  "headings": [
    { "level": 1, "text": "How HTTP works" },
    { "level": 2, "text": "Requests" }
  ],
  "links": [
    { "url": "https://www.rfc-editor.org/rfc/rfc7230",
      "text": "RFC 7230",
      "internal": false }
  ],
  "meta": {
    "language": "en",
    "word_count": 1240,
    "reading_time": 5
  },
  "crawl": {
    "fetched_at": "2026-05-10T12:34:56Z",
    "status_code": 200,
    "render_mode": "static",
    "fetch_duration_ms": 412,
    "content_length": 18402,
    "user_agent": "AIWebIndex/1.0 (+https://example.com/bot; example-impl)"
  }
}

5. Verification mechanism §

A site owner proves they control a domain by responding to a verification token. Conformant implementations MUST support at least one of the two methods below; implementations SHOULD support both.

5.1 DNS TXT record §

The implementation generates a verification token and instructs the owner to publish a DNS TXT record under _aiwebindex-verify.<domain> with the value aiwi-verify=<token>:

_aiwebindex-verify.example.com  TXT  "aiwi-verify=8a93c5f2..."

The implementation MUST query at least one authoritative DNS resolver for this record. The implementation SHOULD query multiple resolvers (e.g., 1.1.1.1, 8.8.8.8, 9.9.9.9) in parallel to tolerate misconfigured local resolvers.

5.2 .well-known file §

The owner publishes a plain-text file at https://<domain>/.well-known/aiwebindex-verify.txt whose body contains aiwi-verify=<token>:

# https://example.com/.well-known/aiwebindex-verify.txt
aiwi-verify=8a93c5f2...

The implementation MUST fetch the file over HTTPS. Plain HTTP fetches MUST be rejected. The implementation SHOULD follow up to 3 redirects within the same registrable domain; cross-domain redirects MUST NOT count as a successful verification.

6. Crawler behavior §

6.1 robots.txt compliance §

Conforming crawlers MUST honor robots.txt directives addressed to AIWebIndex as the User-Agent. Wildcard rules (User-agent: *) MUST apply when no agent-specific block is present.

6.2 Rate limiting §

Implementations SHOULD enforce a per-origin cooldown between consecutive fetches of the same domain. 2 seconds is RECOMMENDED as the minimum default; implementations MAY raise this for sites that request a longer Crawl-delay in robots.txt.

6.3 Backoff §

Implementations MUST respect HTTP 429 Too Many Requests responses by pausing fetches to that origin for at least the duration of the Retry-After header, or 60 seconds if absent. Implementations MUST respect 503 Service Unavailable the same way.

7. Security considerations §

Verification tokens. Implementations MUST generate verification tokens with at least 128 bits of entropy and SHOULD rotate them on every issuance. A token previously issued for a domain MUST NOT be reused for a different domain without explicit user action.

HTTPS only for fetch. Verification fetches (Section 5.2) MUST use HTTPS. Crawl fetches SHOULD prefer HTTPS where the origin advertises it.

Origin spoofing. Implementations MUST NOT fabricate User-Agent strings to imitate browsers or search crawlers in order to bypass site-operator decisions.

Authenticated content. Implementations MUST NOT attempt to bypass paywalls, login walls, or technical access controls. The protocol applies only to publicly accessible content.

8. Privacy considerations §

End-user identity. The protocol carries no field for end-user identification. Implementations MUST NOTembed end-user identifiers (IP addresses, cookies, account IDs of the agent’s caller) into either the request or the AIDocument response.

PII in extracted content. Source pages may contain personally-identifying information published by their authors. The protocol does not require implementations to remove or transform such content; downstream consumers are responsible for compliance with applicable data-protection law.

Verification record visibility. DNS TXT records and HTTP files used for verification (Section 5) are public. Site owners SHOULD use opaque tokens, not human identifiers, in those records.

9. References §

RFC 2119: Key words for use in RFCs to indicate requirement levels. rfc-editor.org/rfc/rfc2119

RFC 3339: Date and Time on the Internet: Timestamps. rfc-editor.org/rfc/rfc3339

RFC 9309: Robots Exclusion Protocol. rfc-editor.org/rfc/rfc9309

BCP 47: Tags for Identifying Languages. rfc-editor.org/info/bcp47

schema.org: Structured-data vocabulary. schema.org

Version history §

1.0 (2026-05-10). Initial publication.

How to implement →See who’s implementing