AI

AI start-up Anthropic accused of aggressively scraping data from websites

Web publishers complain: Developer unauthorizedly collects content for AI training and ignores requests to stop.

Eulerpool News Jul 28, 2024, 1:12 PM

The AI startup Anthropic is accused of aggressively scraping data from websites to train its systems, possibly violating publisher terms of use, according to affected parties.

Sure! The translated heading in English is:

"AI developers rely on large amounts of data from various sources to create large language models that form the technology behind chatbots like OpenAI's ChatGPT and Anthropic's competitor Claude.

Anthropic was founded by former OpenAI researchers and promises to develop "responsible" AI systems. Nevertheless, Matt Barrie, CEO of Freelancer.com, accuses the San Francisco-based company of being "by far the most aggressive scraper" of his freelancer platform, which garners millions of daily visits.

Other web publishers share Barrie's concerns that Anthropic is overwhelming their sites and ignoring their instructions to stop collecting content. According to Barrie, Freelancer.com received 3.5 million visits from an Anthropic-linked web crawler within four hours. "That's probably about five times as much as the number two," said Barrie.

Visits by this bot continued to increase, even after Freelancer.com attempted to deny access using standard protocols. Barrie then decided to block all traffic from Anthropic's IP addresses. "We had to block them because they don't abide by the rules of the internet," said Barrie. "This blatant scraping slows the site down for all users and ultimately affects our revenue.

Anthropic announced that they are investigating the matter and respect the publishers' requests not to be "intrusive or disruptive.

Sure, here is the translation:

"Scraping publicly accessible data is generally legal, but it may violate website terms of use and can be costly for site operators. Kyle Wiens, CEO of iFixit.com, said that his electronics repair site received a million hits from Anthropics bots within 24 hours. 'We have many high traffic alarms that wake people up at 3 AM. This set off all our alarms,' he said.

iFixit’s Terms of Use Prohibit the Use of Their Data for Machine Learning. “My first message to Anthropic is: Using this to train your model is illegal. My second message is: This is not polite internet behavior. Crawling is a matter of etiquette.”

Websites use the "robots.txt" protocol to keep crawlers and other web robots away from certain areas of their pages, but this relies on voluntary compliance. Anthropic said its crawlers respect "anti-circumvention technologies" like CAPTCHAs and that "our crawling should not be intrusive or disruptive.

Web scraping has dramatically increased over the past two years due to the AI arms race, causing new costs for website operators. 'AI crawlers have caused us significant bandwidth costs and taken up a lot of time dealing with abuse,' wrote Eric Holscher, co-founder of the documentation hosting site Read the Docs, in a blog post.

Anthropic has created some of the world's most advanced chatbots, rivaling OpenAI's ChatGPT, and positions itself as an ethical player. Anthropic's stated goal is the "responsible development and maintenance of advanced AI for the long-term benefit of humanity.

While leading AI companies develop increasingly powerful models, they delve deeper into unexplored corners of the internet, collaborate with publishers, or create synthetic training data. OpenAI has made several deals with publishers and content providers like Reddit, The Atlantic, and the Financial Times in recent months. Anthropic has not publicly announced similar partnerships.

Search engines have always done a lot of scraping," said Barrie, "but with the training of generative AI, it has risen to a whole new level.

Sure, here's the translation of the heading to English:

"iFixit's mission is to share information to encourage people to repair things themselves. 'We don't mind if they use our content for model training, we just want to be part of the conversation,' said Wiens. 'I'm not a crusader on this issue, I'm just trying to keep a website online.'

Own the gold standard ✨ in financial data & analytics
fair value · 20 million securities worldwide · 50 year history · 10 year estimates · leading business news

Subscribe for $2

News