Sona

The AI Crawler Directory

A plain-English reference for the bots that read the web on behalf of AI systems. For each crawler you'll find who runs it, what it's used for, its exact user-agent token, and how to allow or block it in robots.txt.

Every crawler we test against

14 AI crawlers and control tokens. Select one for a full reference, including the robots.txt rules to allow or block it.

Model training

GPTBot

by OpenAI

OpenAI's training crawler. Allowing it lets your content be used to train future GPT models.

GPTBot
Search & answers

OAI-SearchBot

by OpenAI

OpenAI's search crawler. Controls whether your pages can surface and be cited in ChatGPT Search.

OAI-SearchBot
User-triggered fetch

ChatGPT-User

by OpenAI

Fetches a page in real time when a ChatGPT user (or a GPT/agent) follows a link or browses on demand.

ChatGPT-User
Model training

ClaudeBot

by Anthropic

Anthropic's training crawler for Claude. Allowing it lets your content inform future Claude models.

ClaudeBot
User-triggered fetch

Claude-Web / anthropic-ai

by Anthropic

Anthropic's user-facing and legacy tokens. Add these alongside ClaudeBot for complete Anthropic coverage.

Claude-Web
Search & answers

PerplexityBot

by Perplexity

Indexes pages so they can be cited in Perplexity's AI answers. A key source of AI referral traffic.

PerplexityBot
Training opt-out token

Google-Extended

by Google

A robots.txt token — not a separate crawler. Controls AI training use without affecting Google Search.

Google-Extended
Model training

CCBot

by Common Crawl

Common Crawl's bot. Its open dataset is one of the most widely used training corpora across the AI industry.

CCBot
Model training

Bytespider

by ByteDance

ByteDance's aggressive training crawler. Frequently blocked due to heavy crawl volume and robots.txt concerns.

Bytespider
Search & answers

Amazonbot

by Amazon

Amazon's crawler, used to answer questions through Alexa and to support Amazon's AI services.

Amazonbot
Training opt-out token

Applebot-Extended

by Apple

A robots.txt token that opts your content out of Apple's generative AI training, without blocking Siri/Spotlight.

Applebot-Extended
Model training

Meta-ExternalAgent

by Meta

Meta's crawler for training Meta AI and the Llama models. Honors robots.txt rules for its token.

meta-externalagent
Search & answers

DuckAssistBot

by DuckDuckGo

Powers DuckAssist, DuckDuckGo's AI answer feature. Privacy-focused and robots.txt compliant.

DuckAssistBot
Model training

cohere-ai

by Cohere

Cohere's crawler, gathering content to train its enterprise-focused language models.

cohere-ai

Reference

How to allow or block AI crawlers

AI crawlers are controlled the same way as any other bot: with robots.txt rules targeting each user-agent token. Add a block per bot you want to control, then a final catch-all if needed. Most reputable AI bots honor these rules — a few (noted on their pages) do not, and need server- or firewall-level blocking.

To allow a specific crawler:

User-agent: GPTBot
Allow: /

To block a specific crawler:

User-agent: Bytespider
Disallow: /

A common policy — opt out of AI training but stay visible in AI search and to traditional search engines:

User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

Note: Google-Extended and Applebot-Extended are training opt-out tokens — disallowing them does not remove you from Google Search, Siri, or Spotlight.

Not sure which bots can actually read your page?

Run any URL through the AI Crawl Checker to see exactly what these crawlers receive — content, status codes, structured data, and robots.txt rules — in seconds.

Check my site