Gone in 60 seconds: BEAST AI model attack needs just a minute of GPU time to breach LLM guardails

Researchers from the University of Maryland have developed a method to generate adversarial attack phrases that can provoke harmful responses from large language models (LLMs). The technique, called BEAST, requires an Nvidia RTX A6000 GPU with 48GB of memory, some open-source code, and a minute of GPU processing time. The team claims that BEAST works much faster than gradient-based attacks and can achieve a 65x speedup over existing methods.

The researchers tested their method on various LLMs using harmful prompts from the AdvBench Harmful Behaviors dataset. They found that in just one minute per prompt, they achieved an attack success rate of 89% on Vicuna-7B- v1.5, compared to a 46% success rate for the best baseline method. The team believes that their technique could be useful for attacking public commercial models like GPT-4, as long as the model's token probability scores from the final network layer can be accessed.

Key takeaways:

Researchers from the University of Maryland have developed a technique called BEAST to generate adversarial attack phrases that elicit harmful responses from large language models (LLMs).
The technique requires an Nvidia RTX A6000 GPU with 48GB of memory, some open source code, and as little as a minute of GPU processing time, and it works 65x faster than existing gradient-based attacks.
The BEAST technique was tested on the AdvBench Harmful Behaviors dataset and achieved an attack success rate of 89 percent on jailbreaking Vicuna-7B- v1.5 in just one minute per prompt.
BEAST can also be used to craft a prompt that elicits an inaccurate response from a model, a 'hallucination', and to conduct a membership inference attack that may have privacy implications.

Gone in 60 seconds: BEAST AI model attack needs just a minute of GPU time to breach LLM guardails

Key takeaways:

Comments (0)

Newsletter