Why Human-Created Input Data Is Needed to Maintain AI Models

The article discusses the phenomenon of generative AI models, which tend to select highly probable choices from the middle of the bell curve of naturally-occurring training data. This is illustrated using the example of a dataset of dog breeds, where the AI would over-represent the most common breed, the golden retriever. Over time, as the AI's outputs are fed back as new training data, the golden retriever becomes the only data represented, leading to a loss of coherence in the AI's outputs, resulting in nonsensical abstract blobs.

The article also mentions the potential dangers of reusing Gen AI outputs, as it can lead to the AI model "forgetting" its original training data. The authors suggest that this issue could be mitigated by maintaining a mix of synthetic and human-created training data. The article concludes with the mention of a proposed California law, AB3211, supported by Open AI, which would require watermarking on AI-generated content to distinguish it from other data, aiding in tracking and excluding such content during data collection.

Key takeaways:

Generative AI tends to select highly probable choices from the middle of the bell curve, which can lead to over-representation of certain data over time.
Over time and multiple generations of feeding the Gen AI outputs back in as new training data, the AI model can start to 'forget' the characteristics of its original training data, leading to more nonsensical outputs.
There is a need to distinguish data generated by Language Model Machines (LLMs) from other data, raising questions about the provenance of content that is crawled from the Internet.
Open AI has expressed support for a proposed California law that would require watermarking on AI-generated content, which would help track and exclude LLM-generated content at scale when collecting new training data.

Why Human-Created Input Data Is Needed to Maintain AI Models

Key takeaways:

Comments (0)

Newsletter