GEO & AI Search

AI Crawling

AI crawling refers to the act and overall process by which AI systems such as ChatGPT, Gemini, and Perplexity automatically gather and read web pages to train models, build search indexes, and answer user questions in real time. Unlike traditional search crawling, which indexes pages to rank them, the defining difference is that the collected content becomes the raw material for training data or generated answers.

AI crawling is the act and process of AI systems gathering the web for training, indexing, and real-time responses, which is a different focus from an AI crawler, the bot doing the gathering.
According to Cloudflare measurements, roughly 80% of AI crawling over the past year was for model training, with 18% for search indexing and 2% for real-time user actions.
Even within OpenAI alone, GPTBot (training), OAI-SearchBot (search), and ChatGPT-User (real time) serve different purposes, so each must be controlled separately in robots.txt.
robots.txt is about blocking access while llms.txt is about pointing AI to readable content; llms.txt by itself cannot prevent training use.
Crawling is surging, but the referral visits it sends back in return are minimal, so the crawl-to-refer imbalance has become a new concern for site operators.

What Is AI Crawling

AI crawling means the act, and the entire process, of generative AI systems automatically fetching and reading web pages for their own purposes. Three distinct goals are blended together here. The first is training-data collection to build large language models, the second is indexing so that content can surface in AI search features, and the third is live fetch, pulling supporting evidence on the spot the moment a user asks a question. This article focuses on that process, policy, and traffic rather than on any particular bot.

One distinction matters. An AI crawler is the bot, the actor that actually sends requests, like GPTBot or ClaudeBot, whereas AI crawling is the behavior, policy, and flow those bots carry out. Because even bots from the same company split along training, search, and real-time purposes, operators are better served in practice by asking which purpose of crawling to allow or block, rather than which bot is showing up.

Traditional Crawling vs AI Crawling

A search engine's traditional crawling and AI crawling share the trait of reading the web automatically, but they differ fundamentally in where the collected data ends up being used.

Aspect	Traditional Crawling	AI Crawling
Primary purpose	Build a search index and compute rankings	Model training, AI search indexing, real-time answer generation
Use of what's collected	Surfaced as links on the search results page (SERP)	Consumed as training data, answer text, and citation evidence
Representative actors	Googlebot, Bingbot	GPTBot, ClaudeBot, PerplexityBot, Google-Extended (token)
Referral traffic	SERP clicks return traffic to the source site	AI handles it within the answer, so clicks and visits tend to be few
Control mechanisms	robots.txt, meta robots, sitemaps	robots.txt (split per bot), llms.txt, training-only tokens (e.g., Google-Extended)
Operator concerns	Indexing, ranking, crawl budget	Consent to training use, crawl-to-refer imbalance, server load

How to Control AI Crawling

AI crawling can be controlled to a degree through standard files. First, though, you need to understand that each file plays a clearly different role. robots.txt is access control that says where not to go, while llms.txt is a guide that says here is the content organized for AI to read easily.

robots.txt — Control Each Bot Separately

The key point is that since each AI bot has a different purpose, you should write separate rules per user-agent. OpenAI, for example, runs GPTBot for training, OAI-SearchBot for search, and ChatGPT-User for real time as distinct agents, so blocking GPTBot alone does not also keep you out of ChatGPT search.

# Example: block training crawls but allow AI search visibility

# OpenAI training — block
User-agent: GPTBot
Disallow: /

# OpenAI search indexing — allow (no rule = allowed)
User-agent: OAI-SearchBot
Disallow:

# Anthropic training/collection — block
User-agent: ClaudeBot
Disallow: /

# Google: leave Googlebot for search untouched and
# block only generative AI (Gemini, etc.) training use via this token
User-agent: Google-Extended
Disallow: /

# Perplexity
User-agent: PerplexityBot
Disallow: /private/

Google-Extended is not a separate bot but a control token used only in robots.txt. Setting this token to Disallow keeps your content from being used in generative AI training and grounding for Gemini and the like, while leaving Google Search indexing and rankings unaffected.

llms.txt — Guide AI to Your Content

llms.txt is a Markdown file placed at the site root. It works more like a sitemap for models, helping AI efficiently find and read your core content without digging through menus, scripts, and layout.

# Example Corp

> A summary that points to our products, pricing, and documentation.

## Docs
- [Getting started](https://example.com/docs/start): installation and basic setup
- [API reference](https://example.com/docs/api): endpoint definitions

## Optional
- [About us](https://example.com/about)

There is a common misconception, however. The idea that simply placing an llms.txt file lets you control training use is not true. llms.txt is only a convenience that helps AI read content well at inference time; blocking training is the job of robots.txt or a training-only token. Keep in mind, too, that both robots.txt and llms.txt are voluntary conventions with no enforcement power, so they have no effect on crawlers that ignore them.

Evidence and Statistics

The scale and nature of AI crawling come through fairly concretely in data Cloudflare has published based on its own network traffic.

Training dominates. By Cloudflare's analysis, roughly 80% of AI crawling purposes over the past year was model training, with 18% for search indexing and 2% for real-time user actions. The training share grew from 72% a year earlier to 80% (source: Cloudflare, The crawl-to-click gap).
AI crawling traffic rose quickly. Between May 2024 and May 2025, AI and search crawler traffic increased by about 18%, and over the same period OpenAI's GPTBot climbed from 2.2% to 7.7% of overall crawler share, rising from 9th to 3rd place (a +305% jump in request volume). ByteDance's Bytespider, by contrast, plunged from 42% to 7.2% (source: Cloudflare, From Googlebot to GPTBot).
Referral relative to crawling is extremely low. Looking at crawl-to-refer (pages crawled per visit), Anthropic improved from 286,930:1 in January 2025 to 38,066:1 in July, but that still means tens of thousands of pages crawled for every single visit. Over the same span OpenAI went from 1,217:1 to 1,091:1, and Google had the smallest gap at 3.8:1 to 5.4:1 (source: Cloudflare, The crawl-to-click gap).

What these figures imply is clear. AI crawling sends very little traffic back to the source site relative to how much content it takes, so operators have reached a point where they must consciously decide, purpose by purpose, whether to allow training use, appear in AI search, or block both.

Implementation Checklist

Start by examining your server and CDN logs to gauge the request volume and frequency of AI crawlers such as GPTBot, ClaudeBot, and PerplexityBot.
Decide separately on whether to allow training and whether to appear in AI search. They are two different policies.
Write rules per bot user-agent in robots.txt, and confirm that blocking GPTBot does not mean blocking OAI-SearchBot.
To keep Google Search but block only Gemini training, add the Google-Extended token as Disallow in robots.txt.
Use llms.txt to point AI to your core content to read, not to block training; keep blocking policy in robots.txt.
To guard against crawlers that ignore standard conventions, also consider blocking measures such as a WAF or bot management as needed.
After a policy change, allow time for it to take effect (for example, OpenAI search updates about 24 hours after a robots.txt change) and reconfirm the result in your logs.