ユニバーサル・ジェイルブレイク・サフィックスは強力なアテンション・ハイジャッカーである

要旨

我々は、大規模言語モデル（LLMs）に対する強力な攻撃手法であるサフィックスベースのジェイルブレイクを研究する。この攻撃は、安全性アライメントを回避するために敵対的サフィックスを最適化するものである。広く使用されている基礎的なGCG攻撃（Zou et al., 2023）に焦点を当て、サフィックスの有効性が異なることを観察した。特に、一部のサフィックスは他のものよりも顕著に普遍的であり、多くの未見の有害な指示に一般化する。まず、GCGの有効性は、敵対的サフィックスから生成前の最終的なチャットテンプレートトークンへの情報フローに基づく浅いが重要なメカニズムによって駆動されていることを示す。生成中におけるこのメカニズムの支配性を定量化し、GCGが文脈化プロセスを不規則かつ積極的に乗っ取ることを明らかにする。特に、乗っ取りの強さが普遍性現象と関連しており、より普遍的なサフィックスほど強力な乗っ取りを行うことを示す。その後、これらの知見が実用的な意味を持つことを示す。GCGの普遍性は、追加の計算コストなしに効率的に向上させることが可能であり（場合によっては最大5倍）、また、最小限のユーティリティ損失で攻撃成功率を少なくとも半減させることも可能である。我々はコードとデータをhttp://github.com/matanbt/interp-jailbreakで公開する。

English

We study suffix-based jailbreaksx2013a powerful family of attacks against large language models (LLMs) that optimize adversarial suffixes to circumvent safety alignment. Focusing on the widely used foundational GCG attack (Zou et al., 2023), we observe that suffixes vary in efficacy: some markedly more universalx2013generalizing to many unseen harmful instructionsx2013than others. We first show that GCG's effectiveness is driven by a shallow, critical mechanism, built on the information flow from the adversarial suffix to the final chat template tokens before generation. Quantifying the dominance of this mechanism during generation, we find GCG irregularly and aggressively hijacks the contextualization process. Crucially, we tie hijacking to the universality phenomenon, with more universal suffixes being stronger hijackers. Subsequently, we show that these insights have practical implications: GCG universality can be efficiently enhanced (up to times5 in some cases) at no additional computational cost, and can also be surgically mitigated, at least halving attack success with minimal utility loss. We release our code and data at http://github.com/matanbt/interp-jailbreak.

ユニバーサル・ジェイルブレイク・サフィックスは強力なアテンション・ハイジャッカーである

Universal Jailbreak Suffixes Are Strong Attention Hijackers

要旨

Support