Data Scraping by AI Startup Anthropic Causes Unrest Among Web Publishers

  • Anthropic accused of violating website publishers' terms of use through excessive scraping.
  • Web publishers demand regulated and consensual use of their content for AI training.

Eulerpool News·

The emerging AI startup Anthropic has come under criticism for relentlessly collecting data from websites to train its systems, potentially violating publishers' terms of service. These allegations are made by several affected companies. To develop large language models, the technological foundation of chatbots like OpenAI's ChatGPT and Anthropic's counterpart Claude, AI developers rely on vast amounts of data from a variety of sources. Anthropic, founded by former OpenAI researchers, aims to develop 'responsible' AI systems. Criticism came, among others, from Matt Barrie, CEO of Freelancer.com, who described the San Francisco-based company as the 'most aggressive scraper' of his platform, which records millions of daily visits. According to Barrie, a web-based 'crawler' linked to Anthropic generated 3.5 million visits to his website within four hours—the fivefold of the next most frequent AI crawler. Attempts to deny access using standardized protocols were unsuccessful, prompting Barrie to block all IP addresses associated with Anthropic. Besides Freelancer.com, other website operators also reported increased access by Anthropic crawlers. Kyle Wiens, CEO of iFixit.com, reported one million accesses within 24 hours, triggering all overload alarms. iFixit explicitly prohibits the use of its data for machine learning in its terms of service. One approach to controlling web robots is the 'robots.txt' protocol, which is, however, based on voluntary compliance. Anthropic emphasized that their crawlers respect these signals once implemented and strive for minimal disruptions. They also stated that they consider technologies like CAPTCHAs to protect against abuse. The topic of data scraping is not new but has gained significant intensity due to the race for advanced AI models, leading to additional costs for website operators. Eric Holscher, co-founder of the documentation platform Read the Docs, quantified the resulting bandwidth costs and the time spent combating abuse as significant. Although Anthropic has positioned itself as an ethical player, it apparently does not have comparable partnerships to OpenAI, which recently made agreements with Reddit, The Atlantic, and the Financial Times to use data legally. Web publishers are advocating for a more intensive examination of data scraping practices to allow a consensual use of their content and ensure the long-term benefits of AI development.
EULERPOOL DATA & ANALYTICS

Make smarter decisions faster with the world's premier financial data

Eulerpool Data & Analytics