Sign up to save tools and stay up to date with the latest in AI
bg
bg
1

Exclusive access for LLM companies to largest Chinese non-fiction book collection in the world

Nov 05, 2023 - annas-blog.org
Anna's Archive, a digital library, has acquired a unique collection of 7.5 million Chinese non-fiction books, larger than the Library Genesis. They are seeking a company or institution to assist with OCR and text extraction for this massive collection. In exchange, they are offering exclusive early access to this collection for a year. The collection is expected to be beneficial for training of LLMs, and the extracted text will enable full-text search of the books for users.

The collection was sourced from Duxiu, a massive database of scanned books created by the SuperStar Digital Library Group, and has been shared with Anna's Archive by a volunteer. The collection, which is difficult to obtain in bulk, is larger than Library Genesis non-fiction and totals about 359TB in its current form. Anna's Archive is open to other proposals and ideas regarding this collection.

Key takeaways:

  • Anna’s Archive has acquired a unique collection of 7.5 million / 350TB Chinese non-fiction books, which is larger than Library Genesis.
  • The Archive is willing to give an LLM company exclusive early access to this collection for 1 year in exchange for high-quality OCR and text extraction.
  • The collection was obtained from Duxiu, a massive database of scanned books created by the SuperStar Digital Library Group, and has been shared with Anna's Archive by a volunteer for long-term preservation.
  • Anna’s Archive is open to other proposals and ideas for collaboration and encourages interested parties to contact them.
View Full Article

Comments (0)

Be the first to comment!