How Johnny Can Persuade LLMs to Jailbreak Them:<br>Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs

The project "How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs" explores the use of persuasive adversarial prompts (PAPs) to manipulate language learning models (LLMs) into generating harmful content. The researchers achieved a 92% attack success rate on aligned LLMs, including GPT-3.5 and GPT-4, without any specialized optimization. They found that more advanced models like GPT-4 are more vulnerable to PAPs, and adaptive defenses crafted to neutralize these PAPs also provide effective protection against other types of attacks.

The researchers also evaluated existing defenses and found that mutation-based methods outperform detection-based methods in lowering the attack success rate (ASR). However, even the most effective defense can only reduce ASR on GPT-4 to 60%, which is still higher than the best baseline attack (54%). They proposed two adaptive defense tactics: "Adaptive System Prompt" and "Targeted Summarization", which proved effective in counteracting PAPs. The project aims to strengthen LLM safety and mitigate the risks associated with persuasive adversarial prompts.

Key takeaways:

The project introduces a taxonomy with 40 persuasion techniques to systematically persuade Language Learning Models (LLMs) to jailbreak them, achieving a 92% attack success rate without any specified optimization.
More advanced models like GPT-4 are found to be more vulnerable to persuasive adversarial prompts (PAPs), and adaptive defenses crafted to neutralize these PAPs also provide effective protection against a spectrum of other attacks.
The researchers found that the more advanced the models are, the less effective current defenses are, possibly because advanced models grasp context better, making mutation-based defenses less useful.
Despite the potential risks, the researchers believe it is crucial to share their findings in full to more systematically study the vulnerabilities around persuasive jailbreak to better mitigate them. They have disclosed their results to Meta and OpenAI before publication.

How Johnny Can Persuade LLMs to Jailbreak Them:<br>Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs

Key takeaways:

Comments (0)

Newsletter