The article also introduces Crescendomation, a tool designed to automate Crescendo, which only requires black-box API access to the target model. Crescendo can also be applied to multi-modal models, guiding them to produce images they would typically be restricted from generating. The authors have reported Crescendo to all impacted organizations and provided them with Crescendomation, aiming to aid in the development of more secure models.
Key takeaways:
- The article introduces Crescendo, a novel jailbreak attack method that can exploit the discrepancy between a language model's potential and actual behavior. Crescendo starts with harmless dialogue and progressively steers the conversation toward prohibited topics.
- Crescendo distinguishes itself from existing jailbreak attacks with its simplicity and high success rate. It does not require attackers to understand the inner workings of the model and can be used with models that have smaller contexts, making it cost-effective.
- The authors also introduce Crescendomation, a tool designed to automate Crescendo. The tool only requires black-box API access to the target model to execute Crescendo and has been successful in jailbreaking almost every task.
- Crescendo can also be applied to multi-modal models, guiding the model to produce images that it would typically be restricted from generating. The authors have reported Crescendo to all impacted organizations and provided them with Crescendomation, adhering to the coordinated vulnerability disclosure protocol.