The researchers have informed their peers and competitors about this vulnerability in the hope of fostering a culture of open sharing of such exploits among LLM providers and researchers. To mitigate this issue, they are working on classifying and contextualizing queries before they go to the model, although limiting the context window has been found to negatively affect the model's performance. The exact mechanism behind this vulnerability is not yet fully understood.
Key takeaways:
- Anthropic researchers have discovered a new 'jailbreak' technique in which a large language model (LLM) can be manipulated to answer harmful questions if primed with less harmful ones first, a method they call 'many-shot jailbreaking.'
- This vulnerability arises from the increased 'context window' of the latest generation of LLMs, allowing them to hold more data in short-term memory and improve their responses over time.
- The researchers found that these models also get better at responding to inappropriate questions if asked after a series of less harmful ones.
- Anthropic is working on mitigation strategies, including classifying and contextualizing queries before they go to the model, despite the potential negative impact on the model's performance.