通用越獄後綴是強大的注意力劫持者
Universal Jailbreak Suffixes Are Strong Attention Hijackers
June 15, 2025
作者: Matan Ben-Tov, Mor Geva, Mahmood Sharif
cs.AI
摘要
我們研究了一種基於後綴的越獄攻擊——這是一類針對大型語言模型(LLMs)的強大攻擊手段,通過優化對抗性後綴來繞過安全對齊機制。聚焦於廣泛應用的基礎性GCG攻擊(Zou等人,2023),我們觀察到後綴的有效性存在差異:某些後綴顯著更具通用性——能夠泛化至多種未見過的有害指令——而其他則不然。我們首先揭示,GCG的有效性源於一個淺層但關鍵的機制,該機制建立在從對抗性後綴到生成前最終聊天模板令牌的信息流之上。量化這一機制在生成過程中的主導作用時,我們發現GCG不規則且激進地劫持了上下文處理過程。關鍵的是,我們將劫持與通用性現象聯繫起來,更通用的後綴往往具有更強的劫持能力。隨後,我們展示了這些洞見的實際應用價值:GCG的通用性可以在不增加計算成本的情況下高效提升(在某些情況下可達五倍),同時也能精準地加以緩解,至少將攻擊成功率減半而僅帶來最小的效用損失。我們在http://github.com/matanbt/interp-jailbreak上公開了我們的代碼和數據。
English
We study suffix-based jailbreaksx2013a powerful family of attacks
against large language models (LLMs) that optimize adversarial suffixes to
circumvent safety alignment. Focusing on the widely used foundational GCG
attack (Zou et al., 2023), we observe that suffixes vary in efficacy: some
markedly more universalx2013generalizing to many unseen harmful
instructionsx2013than others. We first show that GCG's
effectiveness is driven by a shallow, critical mechanism, built on the
information flow from the adversarial suffix to the final chat template tokens
before generation. Quantifying the dominance of this mechanism during
generation, we find GCG irregularly and aggressively hijacks the
contextualization process. Crucially, we tie hijacking to the universality
phenomenon, with more universal suffixes being stronger hijackers.
Subsequently, we show that these insights have practical implications: GCG
universality can be efficiently enhanced (up to times5 in some cases) at no
additional computational cost, and can also be surgically mitigated, at least
halving attack success with minimal utility loss. We release our code and data
at http://github.com/matanbt/interp-jailbreak.