The authors developed an attack, ArtPrompt, that exploits LLMs' poor performance in recognizing ASCII art to bypass safety measures and trigger undesired behaviors. This attack only requires black-box access to the LLMs, making it a practical threat. The paper shows that ArtPrompt can effectively induce undesired behaviors from five state-of-the-art LLMs, including GPT-3.5, GPT-4, Gemini, Claude, and Llama2.
Key takeaways:
- The paper proposes a novel ASCII art-based jailbreak attack, revealing vulnerabilities in large language models (LLMs).
- It introduces a comprehensive benchmark, Vision-in-Text Challenge (ViTC), to evaluate the capabilities of LLMs in recognizing prompts that cannot be solely interpreted by semantics.
- Five state-of-the-art LLMs (GPT-3.5, GPT-4, Gemini, Claude, and Llama2) were found to struggle with recognizing prompts provided in the form of ASCII art.
- The authors developed a jailbreak attack, ArtPrompt, which leverages the poor performance of LLMs in recognizing ASCII art to bypass safety measures and elicit undesired behaviors from LLMs.