The authors also highlight the lack of transparency from AI developers about their training data and the potential for litigation. They argue that the burden of avoiding copyright infringement is unfairly placed on the user, as the AI systems do not provide any information about the provenance of the images they produce. The authors call for AI developers to document their data sources more carefully, restrict themselves to data that is properly licensed, include artists in the training data only if they consent, and compensate artists for their work.
Key takeaways:
- Large language models (LLMs) like Google DeepMind and OpenAI's GPT-4 have been found to "memorize" and reproduce substantial chunks of text from their training sets, raising concerns about potential copyright infringement.
- Generative AI systems like Midjourney V6 and OpenAI's DALL-E 3 have been found to produce near-verbatim or "plagiaristic" visual outputs based on copyrighted materials, even without direct prompts to do so.
- These findings suggest that generative AI developers may be training their systems on copyrighted materials without proper licensing or transparency, potentially exposing users to copyright infringement claims.
- The authors argue that the only ethical solution is for generative AI systems to limit their training to data they have properly licensed, and to be transparent about their data sources.