Sign up to save tools and stay up to date with the latest in AI
bg
bg
1

Here’s Proof You Can Train an AI Model Without Slurping Copyrighted Content

Mar 21, 2024 - wired.com
In 2023, OpenAI stated that it was impossible to train leading AI models without using copyrighted materials, a claim now challenged by recent developments. A group of French government-backed researchers have released a large AI training dataset composed entirely of public domain text, while nonprofit Fairly Trained has awarded its first certification for a large language model built without copyright infringement. The model, KL3M, was developed by legal tech consultancy startup 273 Ventures using a curated training dataset of legal, financial, and regulatory documents.

In addition, researchers have released Common Corpus, the largest available AI dataset for language models composed purely of public domain content. The dataset, which contains 500 billion tokens, was built from sources like public domain newspapers and is posted on the open source AI platform Hugging Face. Despite these advancements, critics note that public domain data is often antiquated, limiting its ability to train AI models in current affairs or contemporary language use.

Key takeaways:

  • OpenAI previously stated that it was impossible to train leading AI models without using copyrighted materials, but recent developments suggest otherwise.
  • A group of researchers backed by the French government have released the largest AI training dataset composed entirely of text that is in the public domain, and Fairly Trained has awarded its first certification for a large language model built without copyright infringement.
  • 273 Ventures has developed KL3M, a large language model trained on a curated dataset of legal, financial, and regulatory documents, demonstrating that AI can be built differently to the industry norm.
  • Common Corpus, a freely available dataset composed purely of public domain content, has been released, offering researchers and startups a vetted training set free from concerns over potential infringement.
View Full Article

Comments (0)

Be the first to comment!