メカニズム的解釈可能性を用いた大規模言語モデルに対する敵対的攻撃の構築

要旨

従来のLLMに対する敵対的摂動生成のためのホワイトボックス手法は、通常、ターゲットモデルからの勾配計算のみに依存し、攻撃の成功または失敗の原因となる内部メカニズムを無視していました。一方、これらの内部メカニズムを分析する解釈可能性の研究は、実行時介入を超えた実用的な応用に欠けていました。私たちはこのギャップを埋めるために、メカニズム的解釈可能性技術を活用して実用的な敵対的入力を生成する新しいホワイトボックス手法を提案します。具体的には、まずモデルの拒否メカニズムをトリガーしない特徴ベクトルの集合である受容サブスペースを特定し、次に勾配ベースの最適化を使用して拒否サブスペースから受容サブスペースへの埋め込みをリルートすることで、効果的にジャイルブレイクを達成します。このターゲットを絞ったアプローチにより、計算コストが大幅に削減され、Gemma2、Llama3.2、Qwen2.5などの最先端モデルにおいて、80-95\%の攻撃成功率を数分または数秒で達成します。これは、既存の技術がしばしば失敗するか、数時間の計算を必要とするのと対照的です。私たちは、このアプローチが攻撃研究と防御開発の両方において新たな方向性を開くものと信じています。さらに、他の手法が効率的でない場合にメカニズム的解釈可能性の実用的な応用を示しており、その有用性を強調しています。コードと生成されたデータセットはhttps://github.com/Sckathach/subspace-reroutingで公開されています。

English

Traditional white-box methods for creating adversarial perturbations against LLMs typically rely only on gradient computation from the targeted model, ignoring the internal mechanisms responsible for attack success or failure. Conversely, interpretability studies that analyze these internal mechanisms lack practical applications beyond runtime interventions. We bridge this gap by introducing a novel white-box approach that leverages mechanistic interpretability techniques to craft practical adversarial inputs. Specifically, we first identify acceptance subspaces - sets of feature vectors that do not trigger the model's refusal mechanisms - then use gradient-based optimization to reroute embeddings from refusal subspaces to acceptance subspaces, effectively achieving jailbreaks. This targeted approach significantly reduces computation cost, achieving attack success rates of 80-95\% on state-of-the-art models including Gemma2, Llama3.2, and Qwen2.5 within minutes or even seconds, compared to existing techniques that often fail or require hours of computation. We believe this approach opens a new direction for both attack research and defense development. Furthermore, it showcases a practical application of mechanistic interpretability where other methods are less efficient, which highlights its utility. The code and generated datasets are available at https://github.com/Sckathach/subspace-rerouting.

メカニズム的解釈可能性を用いた大規模言語モデルに対する敵対的攻撃の構築

Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models

要旨

Support