通用越獄後綴是強大的注意力劫持者

摘要

我們研究了一種基於後綴的越獄攻擊——這是一類針對大型語言模型（LLMs）的強大攻擊手段，通過優化對抗性後綴來繞過安全對齊機制。聚焦於廣泛應用的基礎性GCG攻擊（Zou等人，2023），我們觀察到後綴的有效性存在差異：某些後綴顯著更具通用性——能夠泛化至多種未見過的有害指令——而其他則不然。我們首先揭示，GCG的有效性源於一個淺層但關鍵的機制，該機制建立在從對抗性後綴到生成前最終聊天模板令牌的信息流之上。量化這一機制在生成過程中的主導作用時，我們發現GCG不規則且激進地劫持了上下文處理過程。關鍵的是，我們將劫持與通用性現象聯繫起來，更通用的後綴往往具有更強的劫持能力。隨後，我們展示了這些洞見的實際應用價值：GCG的通用性可以在不增加計算成本的情況下高效提升（在某些情況下可達五倍），同時也能精準地加以緩解，至少將攻擊成功率減半而僅帶來最小的效用損失。我們在http://github.com/matanbt/interp-jailbreak上公開了我們的代碼和數據。

English

We study suffix-based jailbreaksx2013a powerful family of attacks against large language models (LLMs) that optimize adversarial suffixes to circumvent safety alignment. Focusing on the widely used foundational GCG attack (Zou et al., 2023), we observe that suffixes vary in efficacy: some markedly more universalx2013generalizing to many unseen harmful instructionsx2013than others. We first show that GCG's effectiveness is driven by a shallow, critical mechanism, built on the information flow from the adversarial suffix to the final chat template tokens before generation. Quantifying the dominance of this mechanism during generation, we find GCG irregularly and aggressively hijacks the contextualization process. Crucially, we tie hijacking to the universality phenomenon, with more universal suffixes being stronger hijackers. Subsequently, we show that these insights have practical implications: GCG universality can be efficiently enhanced (up to times5 in some cases) at no additional computational cost, and can also be surgically mitigated, at least halving attack success with minimal utility loss. We release our code and data at http://github.com/matanbt/interp-jailbreak.

通用越獄後綴是強大的注意力劫持者

Universal Jailbreak Suffixes Are Strong Attention Hijackers

摘要

Support