Back to Glossary
GEO & AI Search

AI Crawler

An AI crawler is an automated bot that collects web pages to train large language models (LLMs) or to generate AI search answers. Prominent examples include OpenAI's GPTBot, Anthropic's ClaudeBot, and Google-Extended, each identifiable and controllable through its own User-Agent and robots.txt token.

  • AI crawlers are bots that traverse the web to gather training data for LLMs and to fetch content for AI search answers, with GPTBot, ClaudeBot, and Google-Extended being the most prominent.
  • Most operators split their bots by purpose (training / search indexing / answering live user requests) and assign each a distinct User-Agent and robots.txt token.
  • By writing per-token Disallow rules in robots.txt, a site owner can apply selective control, such as blocking AI training while still allowing visibility in AI search.
  • Google-Extended is not a separate crawler but a robots.txt control token, and blocking it has no effect on Google Search rankings or indexing.
  • According to Cloudflare data (May 2025), GPTBot accounted for roughly 30% and ClaudeBot for roughly 21% of AI-specific crawler traffic, the two largest shares.

What Is an AI Crawler?

An AI crawler is a bot operated by a company running generative AI services to automatically collect the text and data on web pages. The data it gathers is used mainly for two purposes. One is to train large language models (LLMs) such as ChatGPT, Claude, and Gemini; the other is to fetch up-to-date information in real time and cite it in answers, as ChatGPT search or AI-generated responses do. Whereas a traditional search crawler (for example, Googlebot) exists to build a search index, an AI crawler adds the goals of model training and AI-generated answers.

It is worth being precise here: "AI crawling" refers to the act or process of collection, while an AI crawler refers to the bot that carries out that act. This article focuses on the bots themselves, that is, which operator runs which bot under what name and User-Agent and for what purpose. Identifying a bot accurately is what makes it possible to control it through robots.txt exactly as intended.

Most operators run their bots separated by purpose. For instance, they keep a training bot, a search-indexing bot, and a bot that fetches pages only when a user asks a question, assigning each a different robots.txt token. This structure lets a site owner set selective policies such as "do not use my content to train models, but do allow it to appear in AI search results."

The Major AI Crawlers at a Glance

The table below lists the major AI crawlers as documented in official sources, with their operator, User-Agent, robots.txt token, and purpose. Because operators update the version number in a User-Agent string (for example, GPTBot/1.3) from time to time, block and allow rules should be written against the robots.txt token name rather than the version number.

Bot NameOperatorrobots.txt TokenPrimary Purpose
GPTBotOpenAIGPTBotCollecting training data for generative AI foundation models
OAI-SearchBotOpenAIOAI-SearchBotSurfacing sites in ChatGPT search results
ChatGPT-UserOpenAIChatGPT-UserFetching a page directly at the moment a user asks a question
ClaudeBotAnthropicClaudeBotCollecting data to train and improve Claude models
Claude-UserAnthropicClaude-UserFetching a page needed to answer a user's question
Claude-SearchBotAnthropicClaude-SearchBotIndexing content for Claude's search feature
Google-ExtendedGoogleGoogle-ExtendedControlling Gemini and Vertex AI training/grounding (not a separate crawler; token only)
CCBotCommon CrawlCCBotBuilding a public web archive (a source of training data for many LLMs)

For reference, the exact User-Agent strings are as follows (per official documentation). OpenAI's GPTBot is Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; GPTBot/1.3; +https://openai.com/gptbot, and Common Crawl's CCBot is CCBot/2.0 (https://commoncrawl.org/faq/). By contrast, Google-Extended has no independent HTTP User-Agent string of its own. The actual crawling is performed under the existing Googlebot User-Agent, and the Google-Extended token is used solely for control within robots.txt.

Controlling AI Crawlers with robots.txt

The standard means of controlling AI crawlers is robots.txt. OpenAI, Anthropic, Google, and Common Crawl all officially state that their bots honor the standard directives in robots.txt. Below are examples you can apply selectively depending on your goals.

# 1) Block all training AI crawlers (AI search visibility is separate)
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

# 2) Block training but allow ChatGPT search visibility
User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Allow: /

# 3) Only limit a specific bot's crawl rate
User-agent: ClaudeBot
Crawl-delay: 1

The key is to write the tokens correctly. For example, even if you block GPTBot, the ChatGPT-search bot OAI-SearchBot is a separate entity, so both tokens must be listed explicitly for your intended policy to take effect. Likewise, Anthropic's ClaudeBot (training) and Claude-SearchBot (search) are kept separate.

Prefer robots.txt Over IP-Based Blocking

Some operators (Anthropic, OpenAI, Common Crawl) publish the IP lists for their bots in JSON form (for example, Anthropic at claude.com/crawling/bots.json and Common Crawl at index.commoncrawl.org/ccbot.json). Anthropic, however, does not recommend IP-based blocking, because it can also block the bot's request to read robots.txt, which means your opt-out intent may not actually be honored. Anthropic and Common Crawl also warn that fake crawlers impersonating their bots exist, and advise verifying authenticity through the published IP lists or a reverse DNS lookup.

Evidence and Real-World Cases

The share and behavior of AI crawlers are confirmed by official documentation and statistics.

  • OpenAI official documentation: OpenAI runs separate crawlers including GPTBot (training), OAI-SearchBot (ChatGPT search visibility), ChatGPT-User (user requests), and OAI-AdsBot (ad page validation), and documents each bot's purpose and robots.txt token.
  • Anthropic support documentation: Anthropic operates three bots, ClaudeBot (training), Claude-User (user requests), and Claude-SearchBot (search indexing), and states that all of them respect the "do not crawl" signal in robots.txt and honor anti-circumvention measures such as CAPTCHAs.
  • Google official documentation: Google states that blocking Google-Extended "does not affect inclusion in Google Search and is not used as a ranking signal." In other words, declining Gemini training and search visibility are controlled independently.
  • Cloudflare traffic analysis (May 2025): Across all crawlers, Googlebot held the largest share at roughly 50%, and among AI-specific crawlers, GPTBot accounted for roughly 30%, ClaudeBot for roughly 21%, and Meta-ExternalAgent for roughly 19%. GPTBot's raw request volume rose 305% year over year.

AI crawler traffic is thus growing fast, and the clear separation between training bots and search bots is an important takeaway for site owners. Depending on whether you want your content kept out of AI training or cited in AI search answers, you can design your robots.txt policy differently.

Sources

What is AI Crawler? | Search OS