The researchers created a dataset of 107,250 questions across 13 forbidden scenarios to assess the potential harm caused by these prompts. Testing this dataset on several popular LLMs, they found that the models' safeguards were not adequate to defend against the jailbreak prompts in all cases. The study aims to help the research community and LLM vendors work towards developing safer and more regulated language models.
Key takeaways:
- Researchers analyzed 1,405 'jailbreak' prompts used to bypass safeguards in large language models (LLMs) like ChatGPT, identifying 131 communities sharing these prompts.
- They observed a shift of jailbreak prompts from web forums to dedicated prompt-aggregation websites, with some users consistently refining effective jailbreak prompts over 100 days.
- Experiments showed that current LLM safeguards are not sufficient to defend against these jailbreak prompts in various harmful scenarios, with five highly effective prompts achieving a 95% success rate in bypassing defenses.
- The researchers hope that this study will help the research community and LLM vendors work towards developing safer and more regulated language models that are better equipped to handle these types of adversarial attacks.