The article also mentions the potential dangers of reusing Gen AI outputs, as it can lead to the AI model "forgetting" its original training data. The authors suggest that this issue could be mitigated by maintaining a mix of synthetic and human-created training data. The article concludes with the mention of a proposed California law, AB3211, supported by Open AI, which would require watermarking on AI-generated content to distinguish it from other data, aiding in tracking and excluding such content during data collection.
Key takeaways:
- Generative AI tends to select highly probable choices from the middle of the bell curve, which can lead to over-representation of certain data over time.
- Over time and multiple generations of feeding the Gen AI outputs back in as new training data, the AI model can start to 'forget' the characteristics of its original training data, leading to more nonsensical outputs.
- There is a need to distinguish data generated by Language Model Machines (LLMs) from other data, raising questions about the provenance of content that is crawled from the Internet.
- Open AI has expressed support for a proposed California law that would require watermarking on AI-generated content, which would help track and exclude LLM-generated content at scale when collecting new training data.