The authors argue that this exploit exposes a fundamental vulnerability in LLMs that has not been addressed. They suggest that this discovery provides an opportunity to gain a deeper understanding of how LLMs function during hallucinations.
Key takeaways:
- The paper presents a new method to manipulate large language models (LLMs) that have been fine-tuned using reinforcement learning from human feedback (RLHF), causing them to revert to their pre-RLHF behavior.
- This method effectively erases the model's filters and works for GPT4, Claude Sonnet, and to some extent, Inflection-2.5.
- The method does not rely on instructing the LLM to override its RLHF policy, and instead induces a hallucination involving reversed text, causing the model to revert to a word bucket.
- The authors believe this exploit presents a fundamental vulnerability in LLMs that is currently unaddressed, and provides an opportunity to better understand the inner workings of LLMs during hallucinations.