Sona

CCBot

Common Crawl's bot. Its open dataset is one of the most widely used training corpora across the AI industry.

OperatorCommon Crawl
PowersThe open Common Crawl dataset used by many LLMs
PurposeModel training
User-agent tokenCCBot
Respects robots.txtYes

CCBot builds Common Crawl, a free and openly available snapshot of the web. Because the dataset is public, it is reused by a large number of AI labs to train models — making CCBot one of the highest-leverage bots to decide on.

Blocking CCBot reduces your exposure across many downstream models at once, since so many training pipelines start from Common Crawl. CCBot respects robots.txt.

Allow CCBot

Maximizes the reach of your content across the many models trained on Common Crawl.

User-agent: CCBot
Allow: /

Block CCBot

A single rule that reduces your footprint in numerous AI training sets that derive from Common Crawl.

User-agent: CCBot
Disallow: /

Can CCBot read your page right now?

Test any URL and see exactly what AI crawlers receive.

Check my site