CCBot

Common Crawl's bot. Its open dataset is one of the most widely used training corpora across the AI industry.

Operator	Common Crawl
Powers	The open Common Crawl dataset used by many LLMs
Purpose	Model training
User-agent token	`CCBot`
Respects robots.txt	Yes

CCBot builds Common Crawl, a free and openly available snapshot of the web. Because the dataset is public, it is reused by a large number of AI labs to train models — making CCBot one of the highest-leverage bots to decide on.

Blocking CCBot reduces your exposure across many downstream models at once, since so many training pipelines start from Common Crawl. CCBot respects robots.txt.

Allow CCBot

Maximizes the reach of your content across the many models trained on Common Crawl.

User-agent: CCBot
Allow: /

Block CCBot

A single rule that reduces your footprint in numerous AI training sets that derive from Common Crawl.

User-agent: CCBot
Disallow: /

Can CCBot read your page right now?

Test any URL and see exactly what AI crawlers receive.

Check my site

CCBot

Allow CCBot

Block CCBot

Can CCBot read your page right now?

Related crawlers

GPTBot

ClaudeBot

Bytespider