ChatPaper.aiChatPaper

AutoDAN-Turbo:一种用于策略自我探索以越狱LLM的终身智能体

AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs

October 3, 2024
作者: Xiaogeng Liu, Peiran Li, Edward Suh, Yevgeniy Vorobeychik, Zhuoqing Mao, Somesh Jha, Patrick McDaniel, Huan Sun, Bo Li, Chaowei Xiao
cs.AI

摘要

本文提出了AutoDAN-Turbo,这是一种黑盒越狱方法,可以自动发现尽可能多的越狱策略,无需任何人工干预或预定义范围(例如指定的候选策略),并将其用于红队行动。结果表明,AutoDAN-Turbo可以显著优于基线方法,在公共基准测试中实现了74.3%更高的平均攻击成功率。值得注意的是,AutoDAN-Turbo在GPT-4-1106-turbo上实现了88.5%的攻击成功率。此外,AutoDAN-Turbo是一个统一的框架,可以以即插即用的方式整合现有的人工设计的越狱策略。通过整合人工设计的策略,AutoDAN-Turbo甚至可以在GPT-4-1106-turbo上实现更高的攻击成功率,达到93.4%。
English
In this paper, we propose AutoDAN-Turbo, a black-box jailbreak method that can automatically discover as many jailbreak strategies as possible from scratch, without any human intervention or predefined scopes (e.g., specified candidate strategies), and use them for red-teaming. As a result, AutoDAN-Turbo can significantly outperform baseline methods, achieving a 74.3% higher average attack success rate on public benchmarks. Notably, AutoDAN-Turbo achieves an 88.5 attack success rate on GPT-4-1106-turbo. In addition, AutoDAN-Turbo is a unified framework that can incorporate existing human-designed jailbreak strategies in a plug-and-play manner. By integrating human-designed strategies, AutoDAN-Turbo can even achieve a higher attack success rate of 93.4 on GPT-4-1106-turbo.

Summary

AI-Generated Summary

PDF123November 16, 2024