The author also discusses the possibility of decompressing the compressed text. The decompression function splits the compressed text into sections and generates the missing parts or directly appends the text. The author acknowledges that this method works better on data the model has been trained on. The author also raises questions about the practicality of training a model for compression, whether this method could identify training data, and if it could be extended to other data types like images.
Key takeaways:
- The author explores the possibility of extracting training text from large language models (LLMs) and using these models to reproduce text they have not been directly trained on.
- The author developed a solution that includes functions to load documents, generate text, compress text, and decompress text.
- The method was tested on the first chapter of 'Alice's Adventures in Wonderland' and achieved significant compression, reducing the number of characters from 11,994 to 986.
- The author raises questions about the practicality of training a model for the purpose of compression, the possibility of identifying training data through this method, the potential performance of different models, and the extension of this method to other data types like images.