AutoDAN-Turbo:一個終身學習智能體,用於自我探索策略以破解LLM。
AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs
October 3, 2024
作者: Xiaogeng Liu, Peiran Li, Edward Suh, Yevgeniy Vorobeychik, Zhuoqing Mao, Somesh Jha, Patrick McDaniel, Huan Sun, Bo Li, Chaowei Xiao
cs.AI
摘要
本文提出了AutoDAN-Turbo,一種黑盒駭客入侵方法,可以自動從頭開始發現盡可能多的駭客入侵策略,而無需任何人為干預或預定範圍(例如指定的候選策略),並將其用於紅隊行動。因此,AutoDAN-Turbo可以顯著優於基準方法,在公共基準測試中實現74.3%更高的平均攻擊成功率。值得注意的是,AutoDAN-Turbo在GPT-4-1106-turbo上實現了88.5%的攻擊成功率。此外,AutoDAN-Turbo是一個統一的框架,可以以即插即用的方式整合現有的人為設計的駭客入侵策略。通過整合人為設計的策略,AutoDAN-Turbo甚至可以在GPT-4-1106-turbo上實現更高的攻擊成功率,達到93.4%。
English
In this paper, we propose AutoDAN-Turbo, a black-box jailbreak method that
can automatically discover as many jailbreak strategies as possible from
scratch, without any human intervention or predefined scopes (e.g., specified
candidate strategies), and use them for red-teaming. As a result, AutoDAN-Turbo
can significantly outperform baseline methods, achieving a 74.3% higher average
attack success rate on public benchmarks. Notably, AutoDAN-Turbo achieves an
88.5 attack success rate on GPT-4-1106-turbo. In addition, AutoDAN-Turbo is a
unified framework that can incorporate existing human-designed jailbreak
strategies in a plug-and-play manner. By integrating human-designed strategies,
AutoDAN-Turbo can even achieve a higher attack success rate of 93.4 on
GPT-4-1106-turbo.Summary
AI-Generated Summary