Natural language processing

OpenAI’s GPTBot gets blocked the most, but it’s not the hungriest AI crawler on the web

OpenAI’s GPTBot gets blocked the most, but it’s not the hungriest AI crawler on the web
OpenAI



summary
Summary

An analysis by Cloudflare shows that Bytespider, Amazonbot, and ClaudeBot are among the most active AI crawlers on the web.

Over the past year, Cloudflare has analyzed which AI crawlers with known user agent strings have the highest request volume. Bytedance’s Bytespider crawler tops the list of most active AI web crawlers, followed by Amazonbot, ClaudeBot, and OpenAI’s GPTBot.

Bytespider likely collects training data for Chinese ChatGPT competitor Doubao, while Amazonbot primarily indexes Alexa responses, according to Cloudflare. ClaudeBot collects training data for Anthropic’s Claude models.

OpenAI's GPTBot gets blocked the most, but it's not the hungriest AI crawler on the web
Bytedance and Anthropic currently seem to be crawling particularly hard for AI training data. | Image: Cloudflare

OpenAI’s GPTBot, which collects training data for products such as ChatGPT, is the second most blocked AI bot and also the bot with the second most website crawls. Bytespider tops both rankings.

Ad

OpenAI's GPTBot gets blocked the most, but it's not the hungriest AI crawler on the web
Image: Cloudflare

Cloudflare’s analysis also shows that many website operators are unaware of the extent of AI crawler activity. According to Cloudflare, AI bots crawled approximately 39% of the top 1 million domains supported by Cloudflare in June. Only 2.98% of those sites blocked or filtered requests.

Higher-ranked sites are more likely to be targeted by AI bots and are more likely to block them, as their content is often part of their core business, and they have the resources to implement technical measures.

OpenAI's GPTBot gets blocked the most, but it's not the hungriest AI crawler on the web
Image: Cloudflare

OpenAI’s GPTBot ranks second in the number of websites crawled, indicating that OpenAI continues to collect a substantial amount of data despite a lower crawling frequency compared to Bytedance’s and Anthropic’s bots. This could be due to more efficient or selective data collection by GPTBot, as well as OpenAI’s existing large dataset from previous crawling processes.

OpenAI CEO Sam Altman recently stated that the focus is now on learning more from high-quality data, rather than accumulating more data. In addition, OpenAI likely already has a large amount of data from previous crawling processes, as Altman also said that his company has enough material to train the next generation of AI models.

The fact that OpenAI’s GPTBot is the most explicitly blocked AI bot via robots.txt is likely due to OpenAI’s transparent communication about this option and the fact that ChatGPT is the most well-known and controversially discussed AI platform in terms of privacy and copyright.

Recommendation

OpenAI's GPTBot gets blocked the most, but it's not the hungriest AI crawler on the web
OpenAI’s AI crawlers are by far the most specifically blocked using robots.txt. | Image: Cloudflare

Cloudflare has also observed that AI bots are increasingly disguising themselves as regular browsers to gain access to content. Perplexity was recently criticized for this practice.

The crawlers change their user agent string. However, Cloudflare’s global machine learning models reliably detect such crawlers based on patterns without the need for manual training, according to the company’s analysis.

To support website operators, Cloudflare has introduced a new feature for all customers that allows all AI bots to be blocked with one click in the dashboard. The feature will be continuously expanded with new fingerprints as Cloudflare identifies more crawlers.

Cloudflare also offers a reporting tool that can be used to report AI crawlers to the company so that they can be analyzed and automatically blocked in the future.

OpenAI's GPTBot gets blocked the most, but it's not the hungriest AI crawler on the web

Source link