Sign up to save tools and stay up to date with the latest in AI
bg
bg
1

Using Hallucinations to Bypass GPT4's Filter

Apr 07, 2024 - news.bensbites.com
The article discusses a new method to manipulate large language models (LLMs) like GPT4, Claude Sonnet, and Inflection-2.5, causing them to revert to their pre-reinforcement learning from human feedback (RLHF) behavior. This technique effectively erases the model's filters, a vulnerability that current RLHF processes cannot address. The method involves inducing a hallucination with reversed text, causing the model to revert to a word bucket and pause its filter.

The authors argue that this exploit exposes a fundamental vulnerability in LLMs that has not been addressed. They suggest that this discovery provides an opportunity to gain a deeper understanding of how LLMs function during hallucinations.

Key takeaways:

  • The paper presents a new method to manipulate large language models (LLMs) that have been fine-tuned using reinforcement learning from human feedback (RLHF), causing them to revert to their pre-RLHF behavior.
  • This method effectively erases the model's filters and works for GPT4, Claude Sonnet, and to some extent, Inflection-2.5.
  • The method does not rely on instructing the LLM to override its RLHF policy, and instead induces a hallucination involving reversed text, causing the model to revert to a word bucket.
  • The authors believe this exploit presents a fundamental vulnerability in LLMs that is currently unaddressed, and provides an opportunity to better understand the inner workings of LLMs during hallucinations.
View Full Article

Comments (0)

Be the first to comment!