攻擊者後手出擊:更強大的適應性攻擊突破LLM越獄與提示注入防禦
The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections
October 10, 2025
作者: Milad Nasr, Nicholas Carlini, Chawin Sitawarin, Sander V. Schulhoff, Jamie Hayes, Michael Ilie, Juliette Pluto, Shuang Song, Harsh Chaudhari, Ilia Shumailov, Abhradeep Thakurta, Kai Yuanqing Xiao, Andreas Terzis, Florian Tramèr
cs.AI
摘要
我們應如何評估語言模型防禦的魯棒性?當前針對越獄攻擊和提示注入的防禦措施(分別旨在防止攻擊者誘導有害知識或遠程觸發惡意行為),通常僅針對一組靜態的有害攻擊字符串進行評估,或是針對未考慮防禦設計的計算能力較弱的優化方法進行測試。我們認為這種評估過程存在缺陷。
相反,我們應評估防禦措施對抗適應性攻擊者的能力,這些攻擊者會明確調整其攻擊策略以應對防禦設計,並投入大量資源來優化其目標。通過系統性地調整和擴展通用優化技術——梯度下降、強化學習、隨機搜索以及人工引導的探索——我們成功繞過了12種基於多樣化技術的最新防禦措施,其中大多數的攻擊成功率超過90%;值得注意的是,這些防禦措施最初報告的攻擊成功率接近零。我們相信,未來的防禦工作必須考慮更強的攻擊,例如我們所描述的這些,才能做出可靠且令人信服的魯棒性聲明。
English
How should we evaluate the robustness of language model defenses? Current
defenses against jailbreaks and prompt injections (which aim to prevent an
attacker from eliciting harmful knowledge or remotely triggering malicious
actions, respectively) are typically evaluated either against a static set of
harmful attack strings, or against computationally weak optimization methods
that were not designed with the defense in mind. We argue that this evaluation
process is flawed.
Instead, we should evaluate defenses against adaptive attackers who
explicitly modify their attack strategy to counter a defense's design while
spending considerable resources to optimize their objective. By systematically
tuning and scaling general optimization techniques-gradient descent,
reinforcement learning, random search, and human-guided exploration-we bypass
12 recent defenses (based on a diverse set of techniques) with attack success
rate above 90% for most; importantly, the majority of defenses originally
reported near-zero attack success rates. We believe that future defense work
must consider stronger attacks, such as the ones we describe, in order to make
reliable and convincing claims of robustness.