Microsoft tested the Skeleton Key attack on several models, evaluating tasks across various risk and safety content categories. All affected models complied fully, although with a warning note prefixing the output. The only exception was GPT-4, which resisted the attack as a direct text prompt but was still affected if the behavior modification request was part of a user-defined system message. Microsoft has announced AI security tools to mitigate such attacks, including a service called Prompt Shields.
Key takeaways:
- Microsoft has published details about Skeleton Key, a technique that can bypass AI model safeguards to generate harmful content, such as instructions for making a Molotov cocktail.
- The Skeleton Key attack has been tested on several AI models, including Meta Llama3-70b-instruct, Google Gemini Pro, and Anthropic Claude 3 Opus, and was found to be effective in most cases.
- Microsoft has developed AI security tools, including a service called Prompt Shields, to help mitigate the risk of such attacks.
- Experts suggest that more robust adversarial attacks, such as Greedy Coordinate Gradient or BEAST, still need to be considered as they could deceive models into believing harmful input or output is not harmful, thereby bypassing current defense techniques.