Researchers use AI chatbots against themselves to 'jailbreak' each other

Researchers from Nanyang Technological University in Singapore have developed a method to 'jailbreak' AI chatbots, causing them to produce content that breaches their developers' guidelines. The team trained a large language model (LLM) on a database of prompts that had previously been successful in hacking these chatbots, creating an LLM chatbot capable of generating further prompts to jailbreak other chatbots. This technique presents a significant threat to AI chatbots, and the researchers have reported their findings to the relevant service providers.

The researchers' method, named "Masterkey," involves reverse-engineering how LLMs detect and defend against malicious queries, and then teaching an LLM to produce prompts that bypass other LLMs' defenses. This process can be automated, creating a jailbreaking LLM that can adapt and create new jailbreak prompts even after developers patch their LLMs. The research has been accepted for presentation at the Network and Distributed System Security Symposium in February 2024.

Key takeaways:

Researchers from Nanyang Technological University have managed to 'jailbreak' multiple AI chatbots, causing them to produce content that breaches their developers' guidelines.
The team used a large language model (LLM) to create a chatbot capable of generating prompts to jailbreak other chatbots, effectively using AI against itself.
The researchers developed a two-fold method for 'jailbreaking' LLMs, named 'Masterkey', which involves reverse-engineering how LLMs detect and defend themselves from malicious queries and teaching an LLM to produce prompts that bypass other LLMs' defenses.
Their work could help companies and businesses to be aware of the weaknesses and limitations of their LLM chatbots and take steps to strengthen them against hackers.

Researchers use AI chatbots against themselves to 'jailbreak' each other

Key takeaways:

Comments (0)

Newsletter