ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs

This paper discusses the vulnerability of large language models (LLMs) to ASCII art-based jailbreak attacks. The authors argue that current safety techniques for LLMs are insufficient as they assume that corpora used for safety alignment are solely interpreted by semantics, an assumption that doesn't hold in real-world applications. They highlight that users often use ASCII art to convey image information, which LLMs struggle to recognize. The paper introduces a novel ASCII art-based jailbreak attack and a benchmark, Vision-in-Text Challenge (ViTC), to evaluate LLMs' ability to recognize prompts that can't be interpreted by semantics alone.

The authors developed an attack, ArtPrompt, that exploits LLMs' poor performance in recognizing ASCII art to bypass safety measures and trigger undesired behaviors. This attack only requires black-box access to the LLMs, making it a practical threat. The paper shows that ArtPrompt can effectively induce undesired behaviors from five state-of-the-art LLMs, including GPT-3.5, GPT-4, Gemini, Claude, and Llama2.

Key takeaways:

The paper proposes a novel ASCII art-based jailbreak attack, revealing vulnerabilities in large language models (LLMs).
It introduces a comprehensive benchmark, Vision-in-Text Challenge (ViTC), to evaluate the capabilities of LLMs in recognizing prompts that cannot be solely interpreted by semantics.
Five state-of-the-art LLMs (GPT-3.5, GPT-4, Gemini, Claude, and Llama2) were found to struggle with recognizing prompts provided in the form of ASCII art.
The authors developed a jailbreak attack, ArtPrompt, which leverages the poor performance of LLMs in recognizing ASCII art to bypass safety measures and elicit undesired behaviors from LLMs.

ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs

Key takeaways:

Comments (0)

Newsletter