The AI Crawler Directory

A plain-English reference for the bots that read the web on behalf of AI systems. For each crawler you'll find who runs it, what it's used for, its exact user-agent token, and how to allow or block it in robots.txt.

Check your page against these bots How to allow or block them

Every crawler we test against

17 AI crawlers and control tokens. Select one for a full reference, including the robots.txt rules to allow or block it.

Model training

GPTBot

by OpenAI

OpenAI's training crawler. Allowing it lets your content be used to train future GPT models.

GPTBot

Search & answers

OAI-SearchBot

by OpenAI

OpenAI's search crawler. Controls whether your pages can surface and be cited in ChatGPT Search.

OAI-SearchBot

User-triggered fetch

ChatGPT-User

by OpenAI

Fetches a page in real time when a ChatGPT user (or a GPT/agent) follows a link or browses on demand.

ChatGPT-User

Model training

ClaudeBot

by Anthropic

Anthropic's training crawler for Claude. Allowing it lets your content inform future Claude models.

ClaudeBot

Search & answers

Claude-SearchBot

by Anthropic

Anthropic's search crawler. Controls whether your pages can surface and be cited when Claude searches the web.

Claude-SearchBot

User-triggered fetch

Claude-User

by Anthropic

Fetches a page in real time when a Claude user asks Claude to open or summarize a specific URL.

Claude-User

Search & answers

PerplexityBot

by Perplexity

Indexes pages so they can be cited in Perplexity's AI answers. A key source of AI referral traffic.

PerplexityBot

User-triggered fetch

Perplexity-User

by Perplexity

Fetches a page in real time when a Perplexity user's question requires visiting it directly.

Perplexity-User

Training opt-out token

Google-Extended

by Google

A robots.txt token — not a separate crawler. Controls AI training use without affecting Google Search.

Google-Extended

Search & answers

Googlebot

by Google

Google's main search crawler — and the bot whose index powers AI Overviews. Blocking it removes you from Google Search entirely.

Googlebot

Model training

CCBot

by Common Crawl

Common Crawl's bot. Its open dataset is one of the most widely used training corpora across the AI industry.

CCBot

Model training

Bytespider

by ByteDance

ByteDance's aggressive training crawler. Frequently blocked due to heavy crawl volume and robots.txt concerns.

Bytespider

Search & answers

Amazonbot

by Amazon

Amazon's crawler, used to answer questions through Alexa and the Rufus shopping assistant.

Amazonbot

Training opt-out token

Applebot-Extended

by Apple

A robots.txt token that opts your content out of Apple's generative AI training, without blocking Siri/Spotlight.

Applebot-Extended

Model training

Meta-ExternalAgent

by Meta

Meta's crawler for training Meta AI and the Llama models. Honors robots.txt rules for its token.

meta-externalagent

Search & answers

DuckAssistBot

by DuckDuckGo

Powers DuckAssist, DuckDuckGo's AI answer feature. Privacy-focused and robots.txt compliant.

DuckAssistBot

Model training

cohere-ai

by Cohere

Cohere's crawler, gathering content to train its enterprise-focused language models.

cohere-ai

Reference

How to allow or block AI crawlers

AI crawlers are controlled the same way as any other bot: with robots.txt rules targeting each user-agent token. Add a block per bot you want to control, then a final catch-all if needed. Most reputable AI bots honor these rules — a few (noted on their pages) do not, and need server- or firewall-level blocking.

To allow a specific crawler:

User-agent: GPTBot
Allow: /

To block a specific crawler:

User-agent: Bytespider
Disallow: /

A common policy — opt out of AI training but stay visible in AI search and to traditional search engines:

User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

Note: Google-Extended and Applebot-Extended are training opt-out tokens — disallowing them does not remove you from Google Search, Siri, or Spotlight.

Not sure which bots can actually read your page?

Run any URL through the AI Crawl Checker to see exactly what these crawlers receive — content, status codes, structured data, and robots.txt rules — in seconds.

Check my site