攻擊者後手出擊：更強大的適應性攻擊突破LLM越獄與提示注入防禦

摘要

我們應如何評估語言模型防禦的魯棒性？當前針對越獄攻擊和提示注入的防禦措施（分別旨在防止攻擊者誘導有害知識或遠程觸發惡意行為），通常僅針對一組靜態的有害攻擊字符串進行評估，或是針對未考慮防禦設計的計算能力較弱的優化方法進行測試。我們認為這種評估過程存在缺陷。相反，我們應評估防禦措施對抗適應性攻擊者的能力，這些攻擊者會明確調整其攻擊策略以應對防禦設計，並投入大量資源來優化其目標。通過系統性地調整和擴展通用優化技術——梯度下降、強化學習、隨機搜索以及人工引導的探索——我們成功繞過了12種基於多樣化技術的最新防禦措施，其中大多數的攻擊成功率超過90%；值得注意的是，這些防禦措施最初報告的攻擊成功率接近零。我們相信，未來的防禦工作必須考慮更強的攻擊，例如我們所描述的這些，才能做出可靠且令人信服的魯棒性聲明。

English

How should we evaluate the robustness of language model defenses? Current defenses against jailbreaks and prompt injections (which aim to prevent an attacker from eliciting harmful knowledge or remotely triggering malicious actions, respectively) are typically evaluated either against a static set of harmful attack strings, or against computationally weak optimization methods that were not designed with the defense in mind. We argue that this evaluation process is flawed. Instead, we should evaluate defenses against adaptive attackers who explicitly modify their attack strategy to counter a defense's design while spending considerable resources to optimize their objective. By systematically tuning and scaling general optimization techniques-gradient descent, reinforcement learning, random search, and human-guided exploration-we bypass 12 recent defenses (based on a diverse set of techniques) with attack success rate above 90% for most; importantly, the majority of defenses originally reported near-zero attack success rates. We believe that future defense work must consider stronger attacks, such as the ones we describe, in order to make reliable and convincing claims of robustness.

攻擊者後手出擊：更強大的適應性攻擊突破LLM越獄與提示注入防禦

The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections

摘要

Support