The AI Crawler Directory
A plain-English reference for the bots that read the web on behalf of AI systems. For each crawler you'll find who runs it, what it's used for, its exact user-agent token, and how to allow or block it in robots.txt.
Every crawler we test against
14 AI crawlers and control tokens. Select one for a full reference, including the robots.txt rules to allow or block it.
GPTBot
by OpenAI
OpenAI's training crawler. Allowing it lets your content be used to train future GPT models.
GPTBotOAI-SearchBot
by OpenAI
OpenAI's search crawler. Controls whether your pages can surface and be cited in ChatGPT Search.
OAI-SearchBotChatGPT-User
by OpenAI
Fetches a page in real time when a ChatGPT user (or a GPT/agent) follows a link or browses on demand.
ChatGPT-UserClaudeBot
by Anthropic
Anthropic's training crawler for Claude. Allowing it lets your content inform future Claude models.
ClaudeBotClaude-Web / anthropic-ai
by Anthropic
Anthropic's user-facing and legacy tokens. Add these alongside ClaudeBot for complete Anthropic coverage.
Claude-WebPerplexityBot
by Perplexity
Indexes pages so they can be cited in Perplexity's AI answers. A key source of AI referral traffic.
PerplexityBotGoogle-Extended
by Google
A robots.txt token — not a separate crawler. Controls AI training use without affecting Google Search.
Google-ExtendedCCBot
by Common Crawl
Common Crawl's bot. Its open dataset is one of the most widely used training corpora across the AI industry.
CCBotBytespider
by ByteDance
ByteDance's aggressive training crawler. Frequently blocked due to heavy crawl volume and robots.txt concerns.
BytespiderAmazonbot
by Amazon
Amazon's crawler, used to answer questions through Alexa and to support Amazon's AI services.
AmazonbotApplebot-Extended
by Apple
A robots.txt token that opts your content out of Apple's generative AI training, without blocking Siri/Spotlight.
Applebot-ExtendedMeta-ExternalAgent
by Meta
Meta's crawler for training Meta AI and the Llama models. Honors robots.txt rules for its token.
meta-externalagentDuckAssistBot
by DuckDuckGo
Powers DuckAssist, DuckDuckGo's AI answer feature. Privacy-focused and robots.txt compliant.
DuckAssistBotcohere-ai
by Cohere
Cohere's crawler, gathering content to train its enterprise-focused language models.
cohere-aiReference
How to allow or block AI crawlers
AI crawlers are controlled the same way as any other bot: with robots.txt rules targeting each user-agent token. Add a block per bot you want to control, then a final catch-all if needed. Most reputable AI bots honor these rules — a few (noted on their pages) do not, and need server- or firewall-level blocking.
To allow a specific crawler:
User-agent: GPTBot Allow: /
To block a specific crawler:
User-agent: Bytespider Disallow: /
A common policy — opt out of AI training but stay visible in AI search and to traditional search engines:
User-agent: GPTBot Disallow: / User-agent: CCBot Disallow: / User-agent: Google-Extended Disallow: / User-agent: OAI-SearchBot Allow: / User-agent: PerplexityBot Allow: /
Note: Google-Extended and Applebot-Extended are training opt-out tokens — disallowing them does not remove you from Google Search, Siri, or Spotlight.
Not sure which bots can actually read your page?
Run any URL through the AI Crawl Checker to see exactly what these crawlers receive — content, status codes, structured data, and robots.txt rules — in seconds.
Check my site