The release of Re-LAION-5B follows an investigation by the Stanford Internet Observatory in December 2023, which found that the original LAION-5B dataset contained at least 1,679 links to illegal images. LAION has urged all research labs and organizations still using the old LAION-5B to migrate to the new Re-LAION-5B datasets as soon as possible. The new dataset, containing around 5.5 billion text-image pairs, can be used by third parties to clean existing copies of LAION-5B by removing the matching illegal content.
Key takeaways:
- LAION, a German research organization, has released a new dataset named Re-LAION-5B, which it claims has been thoroughly cleaned of links to suspected child sexual abuse material (CSAM). The dataset was cleaned with recommendations from various organizations including the Internet Watch Foundation and Human Rights Watch.
- The new dataset is a re-release of an old dataset, LAION-5B, and is available in two versions, Re-LAION-5B Research and Re-LAION-5B Research-Safe. Both versions were filtered for thousands of links to known and likely CSAM.
- The release of Re-LAION-5B comes after an investigation by the Stanford Internet Observatory found that the original LAION-5B dataset contained at least 1,679 links to illegal images. The report recommended that models trained on LAION-5B should be deprecated and distribution ceased where feasible.
- The new Re-LAION-5B dataset, which contains around 5.5 billion text-image pairs, can be used by third parties to clean existing copies of LAION-5B by removing the matching illegal content. LAION stresses that its datasets are intended for research, not commercial, purposes.