The researchers' method, named "Masterkey," involves reverse-engineering how LLMs detect and defend against malicious queries, and then teaching an LLM to produce prompts that bypass other LLMs' defenses. This process can be automated, creating a jailbreaking LLM that can adapt and create new jailbreak prompts even after developers patch their LLMs. The research has been accepted for presentation at the Network and Distributed System Security Symposium in February 2024.
Key takeaways:
- Researchers from Nanyang Technological University have managed to 'jailbreak' multiple AI chatbots, causing them to produce content that breaches their developers' guidelines.
- The team used a large language model (LLM) to create a chatbot capable of generating prompts to jailbreak other chatbots, effectively using AI against itself.
- The researchers developed a two-fold method for 'jailbreaking' LLMs, named 'Masterkey', which involves reverse-engineering how LLMs detect and defend themselves from malicious queries and teaching an LLM to produce prompts that bypass other LLMs' defenses.
- Their work could help companies and businesses to be aware of the weaknesses and limitations of their LLM chatbots and take steps to strengthen them against hackers.