Sign up to save tools and stay up to date with the latest in AI
bg
bg
1

GPTBot: OpenAI launches a web crawler to improve its large language models

Aug 08, 2023 - aibeat.co
OpenAI has launched a web crawler to enhance its future models, possibly including GPT-4 and GPT-5, according to a company blog post. The crawler will be designed to exclude sources requiring paywall access and those known for collecting personally identifiable information or containing text that breaches OpenAI's policies. The company has also provided guidelines for website owners who want to prevent the GPTbot from crawling their websites or restrict its access to certain areas.

The company has been criticized for scraping paywalled content from various publications and using it to train its large language models without the publishers' consent. While publishers could have prevented this by modifying their robots.txt files, many were unaware that their data was being used in this way. The reasons behind OpenAI's decision to launch this crawler with controls for site owners are unclear, and could be due to a desire to improve, pressure from stakeholders, or other factors.

Key takeaways:

  • OpenAI has launched a web crawler that could potentially improve future models, including GPT-4 and GPT-5.
  • The new web crawler will be configured to filter out sources that require paywall access, collect personally identifiable information, or contain text that violates OpenAI’s policies.
  • OpenAI has provided instructions for website owners who want to prevent the GPTbot from crawling their websites or restrict its access to specific areas.
  • OpenAI has been scraping the internet for years to train its large language models, often without the consent of publishers or website owners.
View Full Article

Comments (0)

Be the first to comment!