The German nonprofit has since deleted multiple versions of the dataset from the internet and has released filters for finding and removing illegal content. One of the companies that used LAION-5B to train its neural networks, Stability AI Ltd., has stated that its recent 2.0 version of Stability Diffusion was trained on a subset of the dataset with less unsafe content. This is not the first time LAION-5B has faced scrutiny, as it was previously involved in a lawsuit for allegedly using copyrighted images and for containing photos of an artist's medical record.
Key takeaways:
- Researchers from the Stanford Internet Observatory (SIO) have found over 1,000 child sexual abuse images in the LAION-5B AI training dataset.
- The illegal images were identified using a data management technique called hashing, and the researchers have reported the image URLs to relevant authorities for removal.
- LAION-5B, released by a German nonprofit in early 2022, comprises over 5 billion images scraped from the web and has been used to train multiple image generation models.
- This is not the first time the LAION-5B dataset has come under scrutiny, with previous issues including a lawsuit over the use of copyrighted images and the discovery of an artist's medical records among the files.