To address these limitations, the author suggests decoupling the model and tokenizer from the key, using a stronger encryption algorithm like XChaCha20, and mitigating model weights leakage by using a secret prefix key. The author also highlights potential applications of this approach in sectors dealing with sensitive information, such as healthcare, finance, and national security. Future work will focus on implementing stronger encryption methods, exploring the use of prefix keys, and refining the approach to decouple the model and tokenizer from the encryption key.
Key takeaways:
- The article discusses the possibility of training language models on encrypted text using the Vigenere cipher, without losing performance. This approach aims to address privacy concerns associated with language models.
- While the Vigenere cipher shows promise, it has limitations such as the model and tokenizer being tied to the encryption key, susceptibility to frequency analysis attacks, and potential model weights leakage.
- Proposed solutions to these limitations include decoupling the model and tokenizer from the key, using a stronger encryption algorithm like XChaCha20, and mitigating model weights leakage through the use of a secret prefix key.
- The author invites cryptanalysts and LLM researchers to try and find the key used in their experiments, offering 50 hours of their time (or an equivalent sum in dollars) to anyone who can successfully break the encryption.