Using Hallucinations to Bypass GPT4's Filter

The article discusses a new method to manipulate large language models (LLMs) like GPT4, Claude Sonnet, and Inflection-2.5, causing them to revert to their pre-reinforcement learning from human feedback (RLHF) behavior. This technique effectively erases the model's filters, a vulnerability that current RLHF processes cannot address. The method involves inducing a hallucination with reversed text, causing the model to revert to a word bucket and pause its filter.

The authors argue that this exploit exposes a fundamental vulnerability in LLMs that has not been addressed. They suggest that this discovery provides an opportunity to gain a deeper understanding of how LLMs function during hallucinations.

Key takeaways:

The paper presents a new method to manipulate large language models (LLMs) that have been fine-tuned using reinforcement learning from human feedback (RLHF), causing them to revert to their pre-RLHF behavior.
This method effectively erases the model's filters and works for GPT4, Claude Sonnet, and to some extent, Inflection-2.5.
The method does not rely on instructing the LLM to override its RLHF policy, and instead induces a hallucination involving reversed text, causing the model to revert to a word bucket.
The authors believe this exploit presents a fundamental vulnerability in LLMs that is currently unaddressed, and provides an opportunity to better understand the inner workings of LLMs during hallucinations.

Using Hallucinations to Bypass GPT4's Filter

Key takeaways:

Comments (0)

Newsletter