Guide

AI.txt vs Llms.txt vs Robots.txt: The Complete AI Crawler Control Guide for 2026

Most articles about AI crawlers ask the wrong question -- whether to block them. The strategic question is which crawlers should train your model, which should retrieve from you for citation, and where licensing replaces both. This guide covers the three control files (robots.txt, llms.txt, AI.txt), the AI crawler taxonomy, the crawl-to-referral economics that should drive your decisions (ClaudeBot 20,583:1 vs PerplexityBot 194.8:1), the Perplexity stealth-crawling case study, and industry-specific decision frameworks. Every claim is verified against published sources.

Nisha Kumari|May 4, 202622 min read

Most articles about AI crawlers ask the wrong question. They ask whether you should “block AI” -- as if there were a single AI to block, a single decision to make, a single file that solves it. None of that is true. There are at least a dozen distinct AI crawlers from at least six vendors, divided into two functionally different categories, governed by three different control files, and the math of allow-vs- block produces wildly different answers depending on which crawler you're looking at.

We see this every week in our customer dashboards. A new brand signs up, asks “should I block GPTBot?” and the honest answer is “it depends on what GPTBot is actually costing you, what ChatGPT-User is bringing back, and whether you're a category where Anthropic-licensed training data is worth more than the citation traffic Claude would otherwise drive.” The answer is operational, not ideological.

Blocking is a distraction. The strategic question is which crawlers should train your model, which should retrieve from you for citation, and where licensing replaces both. This guide is the framework we use to walk customers through that decision.

What you'll get out of this post:

A clean comparison of the three control files -- robots.txt, llms.txt, and AI.txt -- including the honest assessment of which ones actually work in 2026.
A taxonomy of every major AI crawler with the strategic distinction nobody else elevates: training crawlers vs retrieval crawlers vs dual-purpose.
The crawl-to-referral economics that should drive your decisions -- including the gap between ClaudeBot at 130,330:1 and PerplexityBot at 194.8:1.
What the Perplexity stealth-crawling controversy (August 2025) teaches about enforcement.
When licensing deals replace blocking entirely -- with the Reddit, FT, and News Corp deal economics.
Sample configurations for publishers, SaaS, and e-commerce.
A 12-question checklist you can audit your current setup against in an afternoon.

The Three Control Files at a Glance

Three files compete for the role of “tell AI what to do with my content.” They are not interchangeable. Each has a different purpose, a different adoption profile, and a wildly different track record on whether AI platforms actually read it.

robots.txt vs llms.txt vs AI.txt: Side-by-Side

Three files, three purposes, three very different adoption realities. robots.txt is the only one with universal honor-system compliance. llms.txt is the most-discussed but least-read.

Attribute	robots.txt	llms.txt	AI.txt
Year introduced	1994	September 2024	2023 (Spawning)
Standardization	RFC 9309 (formal)	Proposed by Answer.AI	Spawning project; informal
Read by	All major AI crawlers (claim to)	No major AI platform reads it (yet)	Spawning's opt-out database; not respected by major AI vendors
Purpose	Disallow / allow URL paths per user agent	Curated map of which pages LLMs should prioritize	Express preferences about training-data use
Adoption (May 2026)	Near-universal on professional sites	~10.13% of 300K surveyed domains	Single-digit percentage; mainly creative-industry sites
Enforcement	Voluntary; honor system	Voluntary signal; consumption is opt-in for AI vendors	Voluntary; some platforms support, most don't
When it matters	Today, for the bots that respect it	If/when AI platforms read it (not yet)	If/when training-data licensing becomes universal

The honest summary: robots.txt is the only file with universal honor-system compliance. llms.txt is the most-discussed but least-read. AI.txt is a Spawning-led project for opting out of training data; it has even thinner adoption. We'll cover each in detail in sections 5-7.

The AI Crawler Taxonomy

Before we touch any control file, you need to know what you're trying to control. AI crawlers split into two functional categories, with one important dual-purpose outlier:

Training crawlers fetch web content to train future AI models. They give you nothing back -- your content shapes a model that competes with you. GPTBot, ClaudeBot, CCBot, Bytespider, and Google-Extended fall here.
Retrieval crawlers fetch web content so an AI platform can cite you in real-time answers. They potentially drive traffic and brand visibility. ChatGPT-User, Claude-User, Claude-SearchBot, OAI-SearchBot are pure retrieval bots.
Dual-purpose bots do both. PerplexityBot is the canonical example -- it indexes for the answer engine but the same crawl feeds training as well. Google-Extended is technically training- only but its data path overlaps with the broader Google ecosystem.

For a complete glossary of crawler-related terms, see our 100-term AI citation dictionary.

AI Crawler Taxonomy: Training vs Retrieval

The strategic question isn't “block AI or not”; it's “which bots get my training data and which earn citation in return.” Each row below shows a bot, its purpose, and the JS-rendering / robots.txt-honoring nuances that affect your decision.

Training crawlers (block to keep your content out of model training)

GPTBot (OpenAI)

Training data collection for future GPT models

Block here to keep your content out of OpenAI training

JS rendering

Honors robots.txt

ClaudeBot (Anthropic)

Training data collection for future Claude models

Block here to keep your content out of Claude training

JS rendering

Honors robots.txt

Google-Extended (Google)

Training data for Gemini and AI Overviews; SEPARATE from Googlebot

Only AI crawler that renders JavaScript. Blocking does NOT remove you from Google Search itself

JS rendering

Honors robots.txt

CCBot (Common Crawl)

Open web archive used as training data by many older LLMs

Indirect AI training feedstock; blocking removes you from many secondary LLMs that train on Common Crawl

JS rendering

Honors robots.txt

Bytespider (ByteDance)

Training data for ByteDance/Doubao AI models

Was largest AI crawler in 2024; share collapsed after public outcry over scraping behavior

JS rendering

Honors robots.txt

Retrieval crawlers (allow to be cited; block to be invisible)

ChatGPT-User (OpenAI)

User-triggered web fetches inside ChatGPT (browse mode)

Block here to remove yourself from ChatGPT's real-time fetches; affects user-initiated browsing

JS rendering

Honors robots.txt

OAI-SearchBot (OpenAI)

Builds the search index ChatGPT Search uses

Block here to remove yourself from ChatGPT Search citations

JS rendering

Honors robots.txt

Claude-User (Anthropic)

User-triggered web fetches inside Claude when user asks for live content

Block here to remove yourself from Claude's user-initiated fetches

JS rendering

Honors robots.txt

Claude-SearchBot (Anthropic)

Search infrastructure for Claude's web-search feature

Block here to remove yourself from Claude's web-search citations

JS rendering

Honors robots.txt

Dual-purpose (the awkward middle)

PerplexityBot (Perplexity)

Indexing for Perplexity's answer engine; also feeds training

Cloudflare Aug 2025: caught using stealth crawlers to bypass robots.txt; the gold-standard cautionary tale

JS rendering

Honors robots.txt

Two patterns to notice in that taxonomy:

First: only Google-Extended renders JavaScript. Every other major AI crawler -- including GPTBot, ClaudeBot, and PerplexityBot -- fetches your HTML and exits without executing JS. If your site is a SPA without server-side rendering, you are already invisible to four-fifths of the AI ecosystem before any robots.txt rule kicks in. We covered this in detail in what AI actually sees when it crawls your site.

Second: all major bots claim to honor robots.txt. The word “claim” is doing a lot of work there -- we'll show in section 8 why honor is voluntary, not enforced.

The Crawl-to-Referral Economics

The most important data point in this entire post is the ratio of pages each AI bot crawls per user it actually sends back to your site. Cloudflare Radar tracks these ratios across hundreds of thousands of sites and the spread between bots is enormous.

Pages Crawled per User Referral (Lower = More Efficient)

Cloudflare Radar tracked how many pages each major AI crawler fetches for every one user that ends up clicking through to the source. The gap between ClaudeBot and PerplexityBot is roughly 100x. This is the data that should drive any allow-or-block decision -- not vendor preference, not moral framing.

Efficient (low ratio)

Moderate

Inefficient (high crawl waste)

Source: Cloudflare Radar -- ClaudeBot ratio measured April 13-20, 2026; GPTBot and PerplexityBot ratios measured July 2025. Chart uses logarithmic scale because the spread is too wide for linear.

Read those numbers carefully. ClaudeBot fetches 130,330 pages for every 1 user it ends up sending to your site. GPTBot is roughly 120x more efficient at 1,091:1. PerplexityBot is the most efficient at 194.8:1 -- though that ratio rose sharply (256%) from January to July 2025, so the trend is in the wrong direction.

This is the data that should drive an allow-or-block decision. If you're a publisher with paid bandwidth and modest AI-citation traffic, ClaudeBot at 130,330:1 is essentially serving you a denial-of-service attack with no return. The economics are different for an e-commerce site with abundant bandwidth and high inventory-page volume, and different again for a SaaS company with low page count but premium content. We'll get to industry frameworks in section 11.

130,330 : 1

ClaudeBot pages crawled per user referral, July 2025 (Cloudflare Radar). Improved from 286,930:1 in January 2025 -- but still the most crawl-heavy major AI bot by a wide margin.

robots.txt: The Universal But Voluntary Standard

robots.txt is the oldest of the three control files (1994, formalized as RFC 9309 in 2022) and the only one with anything resembling universal adoption. Every major AI crawler -- GPTBot, ClaudeBot, PerplexityBot, Google-Extended, CCBot, Bytespider -- claims to read and respect it.

The syntax is simple. To block GPTBot from training on your entire site:

User-agent: GPTBot
Disallow: /

To block training but allow real-time retrieval (the strategic choice we'll defend in section 9):

# Block training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

# Allow retrieval crawlers (be cited in real-time AI answers)
User-agent: ChatGPT-User
Allow: /

User-agent: Claude-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

That configuration says: don't use my content to train future models, but do cite me when a user asks a real-time question. It's a coherent strategic position, not a contradiction.

The catch: robots.txt is voluntary. There is no protocol-level enforcement. A bot reads it because the vendor chooses to honor it. When a vendor decides not to -- or builds a parallel non-declared crawler -- there is no automatic recourse. That's section 8's case study. For now, treat robots.txt as the necessary baseline that works against cooperative bots and means nothing against uncooperative ones.

llms.txt: The Curation Signal That No Platform Reads (Yet)

llms.txt was proposed by Jeremy Howard at Answer.AI in September 2024. The pitch: instead of blocking, give LLMs a curated map of your most important content. Help them find what you'd want them to summarize.

The syntax is markdown. A typical /llms.txt looks like:

# Ranqo

> Track AI visibility across ChatGPT, Claude, Perplexity,
> Gemini, and Grok. Find which AI platforms cite your brand,
> what they say, and how that changes over time.

## Documentation

- [Getting Started](/docs/getting-started.md): Setup and first brand
- [Tracking Setup](/docs/tracking.md): Configure prompt sets and platforms
- [Source Tracking](/docs/sources.md): See which third-party sites get cited

## Blog (canonical references)

- [What is GEO?](/blog/what-is-generative-engine-optimization-geo-guide)
- [Schema for AI Citations](/blog/schema-markup-for-ai-citations)
- [AI Citation Dictionary](/blog/ai-citation-dictionary-100-terms)

That's the optimistic story. The honest story is harder. Adoption is climbing -- our analysis in the llms.txt complete guide put adoption at roughly 10% of surveyed domains as of early 2026, up from near-zero a year earlier. But the same analysis found no measurable correlation between llms.txt presence and AI-citation frequency. None of the major AI platforms have publicly committed to reading the file when generating answers. Google said explicitly that it relies on standard SEO signals, not on llms.txt.

The llms.txt ecosystem in 2026: 10% adoption on the publishing side, 0% adoption on the consumption side. It's a hand waving for help in a room nobody is looking at yet. That may change. It hasn't yet.

What to do with that reality:

Ship one if it's cheap. Generating an llms.txt from existing site structure is a 30-minute task. The cost is low and the option value if AI platforms ever read the file is real.
Don't treat it as a visibility lever. No 2026 evidence supports the claim that llms.txt drives AI citations. Skeptical vendors who pitch “llms.txt SEO” are selling hype, not measurement.
Watch the consumption side. If a major platform starts reading llms.txt, the calculus flips. Until then, treat it as forward-investment, not current-strategy.

AI.txt and Emerging Alternatives

AI.txt is the Spawning-led proposal that lets creators express preferences about whether their work can be used for AI training. It's separate from robots.txt (which controls crawl access) and llms.txt (which curates content for AI consumption). Where robots.txt says “you may or may not crawl,” AI.txt says “you may or may not train.”

The Spawning project also operates a “haveibeentrained.com” database that artists and creators can opt out of. Major image generators (Stability AI, Adobe Firefly) have integrated some of these preferences into their training pipelines. Major LLM vendors (OpenAI, Anthropic, Google) have not formally committed to AI.txt compliance.

Other emerging standards in the same space include C2PA (Coalition for Content Provenance and Authenticity, focused on cryptographic content signing), the IETF's evolving opt-out drafts, and various creator-focused registries. None has reached the universal-recognition status of robots.txt; most are creative-industry specific.

For most marketers and technical SEOs in 2026, AI.txt is a watch-this-space file rather than a deploy-tomorrow file. If you're an image-heavy brand or work with creator contracts, it matters more. If you're a SaaS or e-commerce site, it's strategically secondary to getting robots.txt right.

Case Study: Perplexity's Stealth-Crawling Controversy (Aug 2025)

On August 4, 2025, Cloudflare published a detailed investigation showing that Perplexity was running undeclared crawlers alongside its declared PerplexityBot user agent. The stealth crawlers used generic Chrome / macOS user-agent strings (indistinguishable from real users at first glance), rotated across multiple IP addresses and ASNs, and ignored robots.txt directives that explicitly disallowed Perplexity.

The numbers Cloudflare reported: 3-6 million daily stealth requests vs 20-25 million from declared bots -- meaning roughly 13-30% of Perplexity's effective crawl was happening outside the standard control mechanism. Cloudflare responded by de-listing PerplexityBot from its verified bot directory and rolling out network-level blocks for customers that opted in.

robots.txt compliance is voluntary. The Perplexity incident isn't an outlier; it's the predictable consequence of an honor system in a market where compliance has business costs. If you're relying on robots.txt alone for enforcement, you're relying on the goodwill of every AI vendor in perpetuity.

What the case taught the industry:

Cloudflare-class WAF enforcement matters. A robots.txt rule plus network-level bot detection plus rate limiting is the minimum 2026 stack for sites that genuinely want compliance.
Vendors caught violating face commercial consequences. Loss of Cloudflare verification, public credibility damage, and potential regulatory exposure are now real costs of stealth crawling.
The detection bar is rising. Tools that previously trusted user-agent strings now fingerprint TLS handshakes, behavioral patterns, and IP-vs-claimed-vendor mismatches. Stealth crawling gets harder year over year.

For Perplexity-specific optimization beyond crawler controls, see how to get cited by Perplexity.

The Strategic Distinction: Training vs Retrieval Crawlers

Here is the framing nobody else elevates as the central decision: a single AI vendor (OpenAI, Anthropic) operates multiple distinct crawlers, and the right move is to evaluate each one against what you get back, not against the vendor brand collectively.

OpenAI's three-bot architecture, for example:

GPTBot -- training only. Trains future GPT models on your content. You get nothing back.
ChatGPT-User -- retrieval only. Fetches your page when a ChatGPT user actively asks for live content. You get visibility, citations, and traffic.
OAI-SearchBot -- retrieval index only. Builds the search index ChatGPT Search uses to cite sources. You get visibility in ChatGPT Search results.

Treating those as one vendor decision is a category error. Blocking GPTBot while allowing ChatGPT-User and OAI-SearchBot is a strictly dominant strategy for most brands: you keep your content out of training data and you stay visible in real-time AI answers. Anthropic has the same three-bot structure (ClaudeBot training; Claude-User and Claude-SearchBot retrieval) and the same logic applies. The same is true of Perplexity -- though there the stealth-crawling history complicates trust.

AI Crawler Market Share + YoY Growth (May 2025)

Cloudflare's annual crawler report measures share of all bot traffic. GPTBot tripled its share in a year and is now the third- largest crawler overall. ClaudeBot lost half its share despite Claude's growing chatbot adoption -- the contradiction is the point.

Source: Cloudflare May 2025 crawler report (annual comparison May 2024 to May 2025). Share is percentage of total bot traffic observed across the Cloudflare network.

The Cloudflare share chart above shows another piece of evidence for the training-vs-retrieval distinction: ClaudeBot lost 46% of its crawler share between May 2024 and May 2025 even as Claude's chatbot adoption was growing. Anthropic explicitly told users it had reduced training-crawl volume. The vendor itself pivoted toward retrieval-driven value rather than training-data-driven value. Sites that mirrored that pivot -- blocking training, allowing retrieval -- ended up aligned with the platform's own product direction.

The Licensing Alternative

For a small set of sites -- premium publishers, large content platforms, anyone whose content is structurally valuable as training data -- the right answer to “block or allow?” is “neither: license it.”

The deals shipped to date:

Reddit-OpenAI (May 16, 2024) -- financial terms not publicly disclosed. First major AI training-data licensing deal. Established the model that every subsequent publisher deal followed.
Financial Times-OpenAI (April 2024) -- exact terms not publicly disclosed; reported by industry press as a multi-year deal that integrates FT content into ChatGPT and licenses training data. Premium positioning for paywalled business news.
News Corp-OpenAI (May 2024) -- reported at $250M+ over 5 years. Covers WSJ, Barron's, NY Post, the Telegraph. The largest publicly reported AI training-data deal.
AP, Axel Springer, Vox, Le Monde -- multiple lower-disclosure deals following the same pattern.

The legal landscape that drives these deals: the New York Times v. OpenAI lawsuit is still active as of May 2026, with copyright infringement claims allowed forward and OpenAI compelled to produce ~20M ChatGPT user logs in discovery. The case has not been settled, which means the legal precedent for unlicensed training data is still being established. Publishers that sign licensing deals are buying certainty in a regulatory vacuum -- and AI vendors that sign them are buying insulation against future judgments.

For most sites, licensing isn't available -- the deals are reserved for premium publishers and platforms with structurally valuable training data. But the existence of these deals is the canary that proves the underlying value of your content. If your content is good enough that AI companies want to train on it, the right move may be to extract value from that interest rather than block it.

Industry-Specific Decision Frameworks

The optimal AI-crawler control strategy varies sharply by industry. What works for a B2B SaaS company is wrong for a news publisher; what works for an e-commerce site is wrong for a research firm. Three industry patterns:

News Publishers: Bot-by-Bot Blocking Rates

BuzzStream surveyed top US/UK news sites and counted which AI bots each one disallows in robots.txt. 79% block at least one training bot; 71% block at least one retrieval bot. Note the gap: Google-Extended is materially less blocked than ClaudeBot or GPTBot, because publishers are unwilling to risk Google search relationships.

Training bot

Retrieval bot

Source: BuzzStream 2025 analysis of top US/UK news sites. Only 14% of surveyed publishers block all AI bots. The training-vs- retrieval gap is the strategic story.

News publishers

Publishers face the worst end of the crawl-economics gap: high content production costs, low marginal traffic value, and direct competition from AI summaries that replace clicks. The BuzzStream survey shows 79% of top news sites block at least one training bot. The honest framework: block training crawlers aggressively (or license them), permit retrieval crawlers selectively, and invest heavily in WAF-level enforcement against stealth crawling. If your content is premium enough to warrant a licensing deal, pursue it; otherwise, block hard and accept lower AI citation volume.

B2B SaaS

SaaS companies have the inverse calculus: low bandwidth cost per crawl (most pages are documentation or marketing), high value from AI citations (buyer-research queries are high-intent), and limited training-data risk (your documentation is unlikely to materially train competing models). Default: allow most retrieval crawlers, allow most training crawlers (with carve-outs for proprietary technical documentation), and treat AI visibility as a top-of-funnel channel. See our how to get cited by ChatGPT and Gemini guides for the optimization side.

E-commerce / DTC

E-commerce sits in the middle. Product pages are commodity content (low training-data risk), but inventory pages can burn bandwidth at scale, and the AI citation behavior dramatically favors marketplaces over DTC sites (the 9x retailer-citation gap between ChatGPT and Google AI Overviews we analyzed elsewhere). Default: allow retrieval crawlers, allow Google-Extended (you want to be in Google AI), block or rate-limit ClaudeBot due to the 130K:1 economics, allow GPTBot if bandwidth permits.

Sample robots.txt Configurations

Three reference configurations you can adapt. None are universally right -- pick the one that matches your industry framework above and modify from there.

Publisher: aggressive training block, selective retrieval allow

# Block all training crawlers
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: CCBot
User-agent: Google-Extended
User-agent: Bytespider
User-agent: Amazonbot
User-agent: Applebot-Extended
Disallow: /

# Allow retrieval crawlers (be cited in real-time AI answers)
User-agent: ChatGPT-User
User-agent: Claude-User
User-agent: OAI-SearchBot
User-agent: Claude-SearchBot
Allow: /

# PerplexityBot: blocked due to stealth-crawling history
User-agent: PerplexityBot
Disallow: /

# Default for all other crawlers
User-agent: *
Allow: /

B2B SaaS: permissive default, protect proprietary docs

# Allow all crawlers by default
User-agent: *
Allow: /

# Carve out proprietary technical documentation
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: CCBot
Disallow: /internal-docs/
Disallow: /customer-only/

# Sitemap reference
Sitemap: https://yoursite.com/sitemap.xml

E-commerce: rate-limited training, full retrieval allow

# Block ClaudeBot due to crawl-economics gap (130,330:1 ratio)
User-agent: ClaudeBot
Disallow: /

# Allow Google-Extended (preserve Google AI inclusion)
User-agent: Google-Extended
Allow: /

# Allow all retrieval crawlers
User-agent: ChatGPT-User
User-agent: Claude-User
User-agent: OAI-SearchBot
User-agent: Claude-SearchBot
User-agent: PerplexityBot
Allow: /

# Allow GPTBot but rate-limit at WAF layer (configure separately)
User-agent: GPTBot
Allow: /
Crawl-delay: 10

# Default
User-agent: *
Allow: /

Sitemap: https://yoursite.com/sitemap.xml

One note on syntax: Crawl-delay is honored inconsistently across crawlers. For real rate limiting on bots that ignore it, use Cloudflare or your CDN to enforce token-bucket limits. robots.txt is a request, not a rule.

Common Mistakes

We see these failure modes repeatedly across customer audits:

Blocking GPTBot but forgetting OAI-SearchBot. You think you've removed yourself from ChatGPT, but ChatGPT Search continues to cite you because it uses a different crawler. The unintended outcome: you block training (good if intended) but stay visible in retrieval (probably good but should be a deliberate choice).
Blocking Google-Extended and worrying about Google Search rankings. Google-Extended is for AI training only. Blocking it does not affect Googlebot, regular Google Search, or AI Overview retrieval (which uses Googlebot infrastructure). The risk is over-blocking out of caution.
Treating llms.txt as a substitute for robots.txt. They solve different problems. llms.txt is a curation map for AI consumers (currently aspirational); robots.txt is access control (currently real). You may need both, but skipping robots.txt is the costlier omission.
Relying on robots.txt alone for enforcement. Per the Perplexity stealth-crawling case study, an honor-system file is insufficient against vendors who choose not to honor it. Pair robots.txt with WAF/CDN-level bot detection if you genuinely need to enforce blocking.
Setting allow/disallow for non-existent user agents. Bots that don't exist (typos, deprecated names, vendor-specific strings like “ChatGPT” that aren't real bots) clutter the file without doing anything. Use the verified taxonomy from section 3.

The AI Crawler Control Checklist

Audit your current setup against these 12 questions. Each should have a defensible answer; if any are “don't know” you have a gap to close.

Does your robots.txt distinguish training crawlers (GPTBot, ClaudeBot, CCBot, Google-Extended) from retrieval crawlers (ChatGPT-User, Claude-User, OAI-SearchBot, Claude-SearchBot, PerplexityBot)?
For each training crawler, can you articulate the business reason it's allowed or blocked? (“Default” isn't an answer.)
For each retrieval crawler, are you allowing it? If not, why are you opting out of being cited?
Is Google-Extended blocked? If yes, is that a deliberate choice to opt out of Google AI / Gemini training, or an accident from over-broad blocking?
Do you have WAF-level or CDN-level bot detection in place for the bots you've blocked? Or are you trusting voluntary compliance?
If you're a publisher, have you investigated whether a licensing deal is available for your scale of content?
If you have an llms.txt, does it map to your most citation-worthy content, or is it stale?
Are you server-rendering your most important content? (No major AI crawler except Google-Extended renders JS.)
Have you tested what each major bot actually sees? (curl with the bot's user agent.)
Do you log AI-bot traffic separately from regular traffic so you can compute crawl-to-referral ratios per bot?
Have you reviewed your robots.txt within the last six months? AI crawler vendors add new bots and change behavior faster than annual reviews catch.
Is your strategy reviewed against actual citation outcomes (visibility tracking) rather than just bandwidth metrics?

The Honest Summary

AI crawler control in 2026 is a business decision dressed up as a technical one. The technical part -- the syntax of robots.txt, the structure of llms.txt, the choice of which bots to allow -- is the easy part. The hard part is being honest about what each crawler is actually costing you and what it's actually returning.

The framework that works:

Treat training crawlers and retrieval crawlers as different decisions, not one vendor-level decision. Block training where the economics don't work; allow retrieval where the citations matter.
Use the crawl-to-referral ratio as the primary numerical input. ClaudeBot at 130,330:1 is a different argument than PerplexityBot at 194.8:1.
Don't rely on robots.txt alone for enforcement. Pair it with WAF-level detection if you genuinely need compliance.
Treat llms.txt as forward-investment, not visibility lever. Ship one if cheap; don't expect citation lift in 2026.
License where it's available and the math works. The Reddit, FT, and News Corp deals are the model -- if your content is valuable enough to interest AI vendors, extract value from that interest rather than just blocking it.
Review quarterly. The crawler ecosystem moves faster than annual SEO reviews. Your robots.txt should evolve with it.

The blocking-vs-allowing debate is an artifact of treating AI crawlers as monolithic. They aren't. The brands that win on AI visibility in 2026-2027 are the ones who built crawler- by-crawler decisions, paired robots.txt with real enforcement, and stopped treating llms.txt as a magic file. Three control files. A taxonomy. The economics. That's the framework.

See which AI crawlers are actually citing your brand

Ranqo tracks AI visibility across ChatGPT, Claude, Perplexity, Gemini, and Grok -- so you can audit whether your robots.txt strategy is producing the citation outcomes you actually want. For broader context, also see the llms.txt complete guide and what AI actually sees when it crawls your site.

Track AI citations

Written by

Nisha Kumari

Co-Founder at Ranqo

Nisha Kumari is Co-Founder at Ranqo, where she leads growth strategy and client acquisition. With a background in digital marketing and financial management, she specializes in SEO, Generative Engine Optimization, and helping brands build visibility across AI platforms.

Share this article

Guide

AI.txt vs Llms.txt vs Robots.txt: The Complete AI Crawler Control Guide for 2026

Nisha Kumari|May 4, 202622 min read

Blocking is a distraction. The strategic question is which crawlers should train your model, which should retrieve from you for citation, and where licensing replaces both. This guide is the framework we use to walk customers through that decision.

What you'll get out of this post:

A clean comparison of the three control files -- robots.txt, llms.txt, and AI.txt -- including the honest assessment of which ones actually work in 2026.
A taxonomy of every major AI crawler with the strategic distinction nobody else elevates: training crawlers vs retrieval crawlers vs dual-purpose.
The crawl-to-referral economics that should drive your decisions -- including the gap between ClaudeBot at 130,330:1 and PerplexityBot at 194.8:1.
What the Perplexity stealth-crawling controversy (August 2025) teaches about enforcement.
When licensing deals replace blocking entirely -- with the Reddit, FT, and News Corp deal economics.
Sample configurations for publishers, SaaS, and e-commerce.
A 12-question checklist you can audit your current setup against in an afternoon.

The Three Control Files at a Glance

robots.txt vs llms.txt vs AI.txt: Side-by-Side

Three files, three purposes, three very different adoption realities. robots.txt is the only one with universal honor-system compliance. llms.txt is the most-discussed but least-read.

Attribute	robots.txt	llms.txt	AI.txt
Year introduced	1994	September 2024	2023 (Spawning)
Standardization	RFC 9309 (formal)	Proposed by Answer.AI	Spawning project; informal
Read by	All major AI crawlers (claim to)	No major AI platform reads it (yet)	Spawning's opt-out database; not respected by major AI vendors
Purpose	Disallow / allow URL paths per user agent	Curated map of which pages LLMs should prioritize	Express preferences about training-data use
Adoption (May 2026)	Near-universal on professional sites	~10.13% of 300K surveyed domains	Single-digit percentage; mainly creative-industry sites
Enforcement	Voluntary; honor system	Voluntary signal; consumption is opt-in for AI vendors	Voluntary; some platforms support, most don't
When it matters	Today, for the bots that respect it	If/when AI platforms read it (not yet)	If/when training-data licensing becomes universal

The AI Crawler Taxonomy

Before we touch any control file, you need to know what you're trying to control. AI crawlers split into two functional categories, with one important dual-purpose outlier:

Training crawlers fetch web content to train future AI models. They give you nothing back -- your content shapes a model that competes with you. GPTBot, ClaudeBot, CCBot, Bytespider, and Google-Extended fall here.
Retrieval crawlers fetch web content so an AI platform can cite you in real-time answers. They potentially drive traffic and brand visibility. ChatGPT-User, Claude-User, Claude-SearchBot, OAI-SearchBot are pure retrieval bots.
Dual-purpose bots do both. PerplexityBot is the canonical example -- it indexes for the answer engine but the same crawl feeds training as well. Google-Extended is technically training- only but its data path overlaps with the broader Google ecosystem.

For a complete glossary of crawler-related terms, see our 100-term AI citation dictionary.

AI Crawler Taxonomy: Training vs Retrieval

Training crawlers (block to keep your content out of model training)

GPTBot (OpenAI)

Training data collection for future GPT models

Block here to keep your content out of OpenAI training

JS rendering

Honors robots.txt

ClaudeBot (Anthropic)

Training data collection for future Claude models

Block here to keep your content out of Claude training

JS rendering

Honors robots.txt

Google-Extended (Google)

Training data for Gemini and AI Overviews; SEPARATE from Googlebot

Only AI crawler that renders JavaScript. Blocking does NOT remove you from Google Search itself

JS rendering

Honors robots.txt

CCBot (Common Crawl)

Open web archive used as training data by many older LLMs

Indirect AI training feedstock; blocking removes you from many secondary LLMs that train on Common Crawl

JS rendering

Honors robots.txt

Bytespider (ByteDance)

Training data for ByteDance/Doubao AI models

Was largest AI crawler in 2024; share collapsed after public outcry over scraping behavior

JS rendering

Honors robots.txt

Retrieval crawlers (allow to be cited; block to be invisible)

ChatGPT-User (OpenAI)

User-triggered web fetches inside ChatGPT (browse mode)

Block here to remove yourself from ChatGPT's real-time fetches; affects user-initiated browsing

JS rendering

Honors robots.txt

OAI-SearchBot (OpenAI)

Builds the search index ChatGPT Search uses

Block here to remove yourself from ChatGPT Search citations

JS rendering

Honors robots.txt

Claude-User (Anthropic)

User-triggered web fetches inside Claude when user asks for live content

Block here to remove yourself from Claude's user-initiated fetches

JS rendering

Honors robots.txt

Claude-SearchBot (Anthropic)

Search infrastructure for Claude's web-search feature

Block here to remove yourself from Claude's web-search citations

JS rendering

Honors robots.txt

Dual-purpose (the awkward middle)

PerplexityBot (Perplexity)

Indexing for Perplexity's answer engine; also feeds training

Cloudflare Aug 2025: caught using stealth crawlers to bypass robots.txt; the gold-standard cautionary tale

JS rendering

Honors robots.txt

Two patterns to notice in that taxonomy:

Second: all major bots claim to honor robots.txt. The word “claim” is doing a lot of work there -- we'll show in section 8 why honor is voluntary, not enforced.

The Crawl-to-Referral Economics

Pages Crawled per User Referral (Lower = More Efficient)

Efficient (low ratio)

Moderate

Inefficient (high crawl waste)

Source: Cloudflare Radar -- ClaudeBot ratio measured April 13-20, 2026; GPTBot and PerplexityBot ratios measured July 2025. Chart uses logarithmic scale because the spread is too wide for linear.

130,330 : 1

ClaudeBot pages crawled per user referral, July 2025 (Cloudflare Radar). Improved from 286,930:1 in January 2025 -- but still the most crawl-heavy major AI bot by a wide margin.

robots.txt: The Universal But Voluntary Standard

The syntax is simple. To block GPTBot from training on your entire site:

User-agent: GPTBot
Disallow: /

To block training but allow real-time retrieval (the strategic choice we'll defend in section 9):

# Block training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

# Allow retrieval crawlers (be cited in real-time AI answers)
User-agent: ChatGPT-User
Allow: /

User-agent: Claude-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

That configuration says: don't use my content to train future models, but do cite me when a user asks a real-time question. It's a coherent strategic position, not a contradiction.

llms.txt: The Curation Signal That No Platform Reads (Yet)

The syntax is markdown. A typical /llms.txt looks like:

# Ranqo

> Track AI visibility across ChatGPT, Claude, Perplexity,
> Gemini, and Grok. Find which AI platforms cite your brand,
> what they say, and how that changes over time.

## Documentation

- [Getting Started](/docs/getting-started.md): Setup and first brand
- [Tracking Setup](/docs/tracking.md): Configure prompt sets and platforms
- [Source Tracking](/docs/sources.md): See which third-party sites get cited

## Blog (canonical references)

- [What is GEO?](/blog/what-is-generative-engine-optimization-geo-guide)
- [Schema for AI Citations](/blog/schema-markup-for-ai-citations)
- [AI Citation Dictionary](/blog/ai-citation-dictionary-100-terms)

The llms.txt ecosystem in 2026: 10% adoption on the publishing side, 0% adoption on the consumption side. It's a hand waving for help in a room nobody is looking at yet. That may change. It hasn't yet.

What to do with that reality:

Ship one if it's cheap. Generating an llms.txt from existing site structure is a 30-minute task. The cost is low and the option value if AI platforms ever read the file is real.
Don't treat it as a visibility lever. No 2026 evidence supports the claim that llms.txt drives AI citations. Skeptical vendors who pitch “llms.txt SEO” are selling hype, not measurement.
Watch the consumption side. If a major platform starts reading llms.txt, the calculus flips. Until then, treat it as forward-investment, not current-strategy.

AI.txt and Emerging Alternatives

For most marketers and technical SEOs in 2026, AI.txt is a watch-this-space file rather than a deploy-tomorrow file. If you're an image-heavy brand or work with creator contracts, it matters more. If you're a SaaS or e-commerce site, it's strategically secondary to getting robots.txt right.

Case Study: Perplexity's Stealth-Crawling Controversy (Aug 2025)

robots.txt compliance is voluntary. The Perplexity incident isn't an outlier; it's the predictable consequence of an honor system in a market where compliance has business costs. If you're relying on robots.txt alone for enforcement, you're relying on the goodwill of every AI vendor in perpetuity.

What the case taught the industry:

Cloudflare-class WAF enforcement matters. A robots.txt rule plus network-level bot detection plus rate limiting is the minimum 2026 stack for sites that genuinely want compliance.
Vendors caught violating face commercial consequences. Loss of Cloudflare verification, public credibility damage, and potential regulatory exposure are now real costs of stealth crawling.
The detection bar is rising. Tools that previously trusted user-agent strings now fingerprint TLS handshakes, behavioral patterns, and IP-vs-claimed-vendor mismatches. Stealth crawling gets harder year over year.

For Perplexity-specific optimization beyond crawler controls, see how to get cited by Perplexity.

The Strategic Distinction: Training vs Retrieval Crawlers

OpenAI's three-bot architecture, for example:

GPTBot -- training only. Trains future GPT models on your content. You get nothing back.
ChatGPT-User -- retrieval only. Fetches your page when a ChatGPT user actively asks for live content. You get visibility, citations, and traffic.
OAI-SearchBot -- retrieval index only. Builds the search index ChatGPT Search uses to cite sources. You get visibility in ChatGPT Search results.

AI Crawler Market Share + YoY Growth (May 2025)

Source: Cloudflare May 2025 crawler report (annual comparison May 2024 to May 2025). Share is percentage of total bot traffic observed across the Cloudflare network.

The Licensing Alternative

The deals shipped to date:

Reddit-OpenAI (May 16, 2024) -- financial terms not publicly disclosed. First major AI training-data licensing deal. Established the model that every subsequent publisher deal followed.
Financial Times-OpenAI (April 2024) -- exact terms not publicly disclosed; reported by industry press as a multi-year deal that integrates FT content into ChatGPT and licenses training data. Premium positioning for paywalled business news.
News Corp-OpenAI (May 2024) -- reported at $250M+ over 5 years. Covers WSJ, Barron's, NY Post, the Telegraph. The largest publicly reported AI training-data deal.
AP, Axel Springer, Vox, Le Monde -- multiple lower-disclosure deals following the same pattern.

For most sites, licensing isn't available -- the deals are reserved for premium publishers and platforms with structurally valuable training data. But the existence of these deals is the canary that proves the underlying value of your content. If your content is good enough that AI companies want to train on it, the right move may be to extract value from that interest rather than block it.

Industry-Specific Decision Frameworks

News Publishers: Bot-by-Bot Blocking Rates

Training bot

Retrieval bot

Source: BuzzStream 2025 analysis of top US/UK news sites. Only 14% of surveyed publishers block all AI bots. The training-vs- retrieval gap is the strategic story.

News publishers

B2B SaaS

E-commerce / DTC

Sample robots.txt Configurations

Three reference configurations you can adapt. None are universally right -- pick the one that matches your industry framework above and modify from there.

Publisher: aggressive training block, selective retrieval allow

# Block all training crawlers
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: CCBot
User-agent: Google-Extended
User-agent: Bytespider
User-agent: Amazonbot
User-agent: Applebot-Extended
Disallow: /

# Allow retrieval crawlers (be cited in real-time AI answers)
User-agent: ChatGPT-User
User-agent: Claude-User
User-agent: OAI-SearchBot
User-agent: Claude-SearchBot
Allow: /

# PerplexityBot: blocked due to stealth-crawling history
User-agent: PerplexityBot
Disallow: /

# Default for all other crawlers
User-agent: *
Allow: /

B2B SaaS: permissive default, protect proprietary docs

# Allow all crawlers by default
User-agent: *
Allow: /

# Carve out proprietary technical documentation
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: CCBot
Disallow: /internal-docs/
Disallow: /customer-only/

# Sitemap reference
Sitemap: https://yoursite.com/sitemap.xml

E-commerce: rate-limited training, full retrieval allow

# Block ClaudeBot due to crawl-economics gap (130,330:1 ratio)
User-agent: ClaudeBot
Disallow: /

# Allow Google-Extended (preserve Google AI inclusion)
User-agent: Google-Extended
Allow: /

# Allow all retrieval crawlers
User-agent: ChatGPT-User
User-agent: Claude-User
User-agent: OAI-SearchBot
User-agent: Claude-SearchBot
User-agent: PerplexityBot
Allow: /

# Allow GPTBot but rate-limit at WAF layer (configure separately)
User-agent: GPTBot
Allow: /
Crawl-delay: 10

# Default
User-agent: *
Allow: /

Sitemap: https://yoursite.com/sitemap.xml

Common Mistakes

We see these failure modes repeatedly across customer audits:

Blocking GPTBot but forgetting OAI-SearchBot. You think you've removed yourself from ChatGPT, but ChatGPT Search continues to cite you because it uses a different crawler. The unintended outcome: you block training (good if intended) but stay visible in retrieval (probably good but should be a deliberate choice).
Blocking Google-Extended and worrying about Google Search rankings. Google-Extended is for AI training only. Blocking it does not affect Googlebot, regular Google Search, or AI Overview retrieval (which uses Googlebot infrastructure). The risk is over-blocking out of caution.
Treating llms.txt as a substitute for robots.txt. They solve different problems. llms.txt is a curation map for AI consumers (currently aspirational); robots.txt is access control (currently real). You may need both, but skipping robots.txt is the costlier omission.
Relying on robots.txt alone for enforcement. Per the Perplexity stealth-crawling case study, an honor-system file is insufficient against vendors who choose not to honor it. Pair robots.txt with WAF/CDN-level bot detection if you genuinely need to enforce blocking.
Setting allow/disallow for non-existent user agents. Bots that don't exist (typos, deprecated names, vendor-specific strings like “ChatGPT” that aren't real bots) clutter the file without doing anything. Use the verified taxonomy from section 3.

The AI Crawler Control Checklist

Audit your current setup against these 12 questions. Each should have a defensible answer; if any are “don't know” you have a gap to close.

Does your robots.txt distinguish training crawlers (GPTBot, ClaudeBot, CCBot, Google-Extended) from retrieval crawlers (ChatGPT-User, Claude-User, OAI-SearchBot, Claude-SearchBot, PerplexityBot)?
For each training crawler, can you articulate the business reason it's allowed or blocked? (“Default” isn't an answer.)
For each retrieval crawler, are you allowing it? If not, why are you opting out of being cited?
Is Google-Extended blocked? If yes, is that a deliberate choice to opt out of Google AI / Gemini training, or an accident from over-broad blocking?
Do you have WAF-level or CDN-level bot detection in place for the bots you've blocked? Or are you trusting voluntary compliance?
If you're a publisher, have you investigated whether a licensing deal is available for your scale of content?
If you have an llms.txt, does it map to your most citation-worthy content, or is it stale?
Are you server-rendering your most important content? (No major AI crawler except Google-Extended renders JS.)
Have you tested what each major bot actually sees? (curl with the bot's user agent.)
Do you log AI-bot traffic separately from regular traffic so you can compute crawl-to-referral ratios per bot?
Have you reviewed your robots.txt within the last six months? AI crawler vendors add new bots and change behavior faster than annual reviews catch.
Is your strategy reviewed against actual citation outcomes (visibility tracking) rather than just bandwidth metrics?

The Honest Summary

The framework that works:

Treat training crawlers and retrieval crawlers as different decisions, not one vendor-level decision. Block training where the economics don't work; allow retrieval where the citations matter.
Use the crawl-to-referral ratio as the primary numerical input. ClaudeBot at 130,330:1 is a different argument than PerplexityBot at 194.8:1.
Don't rely on robots.txt alone for enforcement. Pair it with WAF-level detection if you genuinely need compliance.
Treat llms.txt as forward-investment, not visibility lever. Ship one if cheap; don't expect citation lift in 2026.
License where it's available and the math works. The Reddit, FT, and News Corp deals are the model -- if your content is valuable enough to interest AI vendors, extract value from that interest rather than just blocking it.
Review quarterly. The crawler ecosystem moves faster than annual SEO reviews. Your robots.txt should evolve with it.

The blocking-vs-allowing debate is an artifact of treating AI crawlers as monolithic. They aren't. The brands that win on AI visibility in 2026-2027 are the ones who built crawler- by-crawler decisions, paired robots.txt with real enforcement, and stopped treating llms.txt as a magic file. Three control files. A taxonomy. The economics. That's the framework.

See which AI crawlers are actually citing your brand

Track AI citations

Written by

Nisha Kumari

Co-Founder at Ranqo

Share this article