Sign up to save tools and stay up to date with the latest in AI
bg
bg
1

Multiple AI Companies Ignore Robots.Txt Files, Scrape Web Content, Says Licensing Firm - Slashdot

Jun 24, 2024 - news.slashdot.org
AI companies are reportedly ignoring Robots.txt files that are designed to prevent the scraping of web content for generative AI systems, according to a warning by content licensing startup TollBit. The startup, which acts as a mediator between AI companies seeking content and publishers willing to license their content, has found that numerous AI agents are bypassing the protocol. This allows them to access and retrieve content from sites without permission.

The president of the News Media Alliance, a trade group representing over 2,200 U.S.-based publishers, expressed concern that without the ability to opt out of massive scraping, publishers cannot monetize their content and pay journalists, which could seriously harm the industry. Additionally, publishers are worried about AI-generated news summaries, particularly since Google launched a product that uses AI to create summaries in response to search queries. To prevent their content from being used in this way, publishers would have to use a tool that would also make them virtually invisible on the web.

Key takeaways:

  • Several AI companies are reportedly ignoring Robots.txt files, which are designed to prevent the scraping of web content for generative AI systems, according to a warning sent by content licensing startup TollBit.
  • TollBit, which acts as a mediator between AI companies seeking content and publishers willing to license their content, has found that numerous AI agents are bypassing the Robots.txt protocol.
  • The president of the News Media Alliance, a trade group representing over 2,200 U.S.-based publishers, warns that without the ability to opt out of massive scraping, publishers cannot monetize their content and pay journalists, which could seriously harm the industry.
  • Publishers are also concerned about Google's AI product that creates news summaries from their content. To prevent their content from being used, they must use the same tool that would also prevent them from appearing in Google search results, making them virtually invisible on the web.
View Full Article

Comments (0)

Be the first to comment!