Defending LLMs against Jailbreaking Attacks via Backtranslation

The article discusses a new method to defend large language models (LLMs) against jailbreaking attacks, which are attempts to manipulate the model's output by altering the input prompt. The proposed method, called "backtranslation," involves generating a backtranslated prompt from the initial response of the target LLM to the input prompt. This backtranslated prompt, which is not directly influenced by the attacker, is then used to rerun the target LLM. If the model refuses the backtranslated prompt, the original prompt is also refused.

The authors argue that this defense method is effective and efficient, offering several benefits. They provide empirical evidence showing that their defense significantly outperforms baseline methods, particularly in challenging cases. Additionally, they note that their defense has minimal impact on the generation quality of benign input prompts.

Key takeaways:

The paper proposes a new method for defending large language models (LLMs) against jailbreaking attacks using 'backtranslation'.
The backtranslation prompts a language model to infer an input prompt that can lead to the response, revealing the actual intent of the original prompt.
The defense method is run again on the backtranslated prompt, and the original prompt is refused if the model refuses the backtranslated prompt.
The proposed defense method is shown to significantly outperform the baselines and has little impact on the generation quality for benign input prompts.

Defending LLMs against Jailbreaking Attacks via Backtranslation

Key takeaways:

Comments (0)

Newsletter