通用越狱后缀是强大的注意力劫持者。

摘要

我们研究了基于后缀的越狱攻击——这是一类针对大型语言模型（LLMs）的强大攻击手段，通过优化对抗性后缀来绕过安全对齐机制。聚焦于广泛使用的基础性GCG攻击（Zou等人，2023），我们观察到不同后缀的有效性存在差异：某些后缀展现出显著的普适性，能够泛化至众多未见过的有害指令，而其他则不然。首先，我们揭示了GCG攻击的有效性源于一个浅层但关键的机制，该机制建立在从对抗性后缀到生成前最终聊天模板令牌的信息流之上。通过量化这一机制在生成过程中的主导作用，我们发现GCG不规则且激进地劫持了上下文处理过程。关键的是，我们将这种劫持与普适性现象联系起来，发现更具普适性的后缀往往具有更强的劫持能力。随后，我们展示了这些洞见具有实际应用价值：GCG的普适性可以在不增加计算成本的情况下高效提升（在某些情况下可达5倍），同时也能被精准缓解，至少将攻击成功率减半，而仅带来最小的效用损失。我们在http://github.com/matanbt/interp-jailbreak上发布了代码与数据。

English

We study suffix-based jailbreaksx2013a powerful family of attacks against large language models (LLMs) that optimize adversarial suffixes to circumvent safety alignment. Focusing on the widely used foundational GCG attack (Zou et al., 2023), we observe that suffixes vary in efficacy: some markedly more universalx2013generalizing to many unseen harmful instructionsx2013than others. We first show that GCG's effectiveness is driven by a shallow, critical mechanism, built on the information flow from the adversarial suffix to the final chat template tokens before generation. Quantifying the dominance of this mechanism during generation, we find GCG irregularly and aggressively hijacks the contextualization process. Crucially, we tie hijacking to the universality phenomenon, with more universal suffixes being stronger hijackers. Subsequently, we show that these insights have practical implications: GCG universality can be efficiently enhanced (up to times5 in some cases) at no additional computational cost, and can also be surgically mitigated, at least halving attack success with minimal utility loss. We release our code and data at http://github.com/matanbt/interp-jailbreak.

通用越狱后缀是强大的注意力劫持者。

Universal Jailbreak Suffixes Are Strong Attention Hijackers

摘要

Support