Natural language processing

Study reveals rapid increase in web domains blocking AI models from training data

Study reveals rapid increase in web domains blocking AI models from training data



summary
Summary

A new study reveals that AI models are increasingly losing access to their web-based training data. This growing trend of restrictions could force models to learn from less, more biased, and outdated information in the future.

The Data Provenance Initiative, an independent academic group, conducted a large-scale study documenting a rapid decrease in web data access for AI models. Researchers analyzed robots.txt files and terms of use for 14,000 web domains that serve as sources for popular AI training datasets like C4, RefinedWeb, and Dolma.

From April 2023 to April 2024, the percentage of tokens in these datasets completely blocked for AI crawlers rose from about 1% to 5-7%. Tokens are the individual sentence and word components used to train AI models.

The increase was even more significant for key data sources, where the proportion of blocked tokens jumped from less than 3% to 20-33%. Researchers predict this trend will continue in the coming months. OpenAI faces the most frequent blocks, followed by Anthropic and Google.

Ad

Study reveals rapid increase in web domains blocking AI models from training data
The visualization provided by the Data Provenance Initiative shows that many content providers have switched to blocking access to their content for AI companies in the second half of 2023, either through robots.txt files, clauses in website terms of use, or both. | Image: Data Provenance Initiative

News websites, forums, and social media platforms are the main sources imposing restrictions. On news sites, the share of completely blocked tokens surged from 3% to 45% within a year.

As a result, their representation in the training data is likely to decline in favor of corporate and e-commerce sites, which have fewer restrictions but often lower quality content. This trend could particularly affect AI developers, as the industry has realized that learning from high-quality data produces better models.

The study also highlights a disparity between the actual use of generative AI models and the content of their training data. This could be relevant in legal cases where publishers sue AI companies, claiming that services like ChatGPT compete with their information offerings based on the publishers’ content.

Study reveals rapid increase in web domains blocking AI models from training data
The most common web services do not correspond to actual ChatGPT use cases, according to the researchers. Left: Share of tokens per web service and their monetization through paywalls/advertising. Right: Share of different user requests in WildChat, a dataset of ChatGPT interactions. | Image: Data Provenance Initiative

Overall, this development could make it more difficult, or at least pricier, to train powerful and reliable AI systems. High-quality content providers could potentially find new revenue streams and become major beneficiaries. But OpenAI and Meta CEO Mark Zuckerberg both also said that licensing all the data they need to train a good AI model would be impossible or unaffordable.

For example, OpenAI has recently negotiated several multi-million dollar deals with publishers to access their content for real-time display in chat systems and AI training. Other companies are likely to follow suit, unless a fair use ruling dramatically changes the situation.

Recommendation

Study reveals rapid increase in web domains blocking AI models from training data

Source link