CCBot
Common Crawl's bot. Its open dataset is one of the most widely used training corpora across the AI industry.
| Operator | Common Crawl |
|---|---|
| Powers | The open Common Crawl dataset used by many LLMs |
| Purpose | Model training |
| User-agent token | CCBot |
| Respects robots.txt | Yes |
CCBot builds Common Crawl, a free and openly available snapshot of the web. Because the dataset is public, it is reused by a large number of AI labs to train models — making CCBot one of the highest-leverage bots to decide on.
Blocking CCBot reduces your exposure across many downstream models at once, since so many training pipelines start from Common Crawl. CCBot respects robots.txt.
Allow CCBot
Maximizes the reach of your content across the many models trained on Common Crawl.
User-agent: CCBot Allow: /
Block CCBot
A single rule that reduces your footprint in numerous AI training sets that derive from Common Crawl.
User-agent: CCBot Disallow: /
Can CCBot read your page right now?
Test any URL and see exactly what AI crawlers receive.