ArtPrompt exposes a vulnerability in AI's interpretation of written text, prioritizing the recognition of ASCII art over safety alignment. This is a new class of AI attack known as a jailbreak, which elicits harmful behaviors from AI models. This follows other known vulnerabilities such as prompt injection attacks, which trick AI into overriding its original instructions.
Key takeaways:
- Researchers have discovered a new way to hack AI assistants using ASCII art, which can distract large language models (LLMs) like GPT-4 and cause them to forget to enforce rules blocking harmful responses.
- The attack, known as ArtPrompt, formats user-entered requests into standard statements with one exception: a single word, represented by ASCII art rather than the letters that spell it, which can result in prompts that would normally be rejected being answered.
- ArtPrompt exposes a problem in LLMs, which are trained to interpret collections of written text purely in terms of the meanings of words, or their semantics, but can be tricked into prioritizing recognition of ASCII art over safety alignment.
- ArtPrompt is a type of jailbreak, a class of AI attack that elicits harmful behaviors from aligned LLMs, such as saying something illegal or unethical.