ASGuard: ターゲット型ジャイルブレイキング攻撃を軽減する活性化スケーリングガード

要旨

大規模言語モデル（LLM）は安全性が調整されているにもかかわらず、脆弱な拒否行動を示し、それは単純な言語的変更によって回避され得る。時制ジャイルブレイキングが示すように、有害な要求を拒否するモデルも、過去形で言い換えられた場合にはしばしば従うことから、現行の調整手法には根本的メカニズムが十分に理解されていない重大な一般化ギャップが存在することが明らかとなった。本研究では、活性化スケーリングガード（ASGuard）を提案する。これは、この特定の脆弱性を外科的に軽減する、機構論的知見に基づいた洞察的な枠組みである。第一段階では、回路分析を用いて、時制変更攻撃などの標的型ジャイルブレイキングに因果的に関与する特定のアテンションヘッドを同定する。第二に、時制に脆弱なヘッドの活性化を再調整するための精密なチャネル単位のスケーリングベクトルを学習する。最後に、これを「予防的ファインチューニング」に適用し、モデルにより頑健な拒否メカニズムを学習させる。4つのLLMにわたり、ASGuardは標的型ジャイルブレイキングの攻撃成功率を効果的に低減しつつ、一般的な能力を保持し、過剰拒否を最小化することで、安全性と有用性のパレート最適なバランスを達成した。我々の知見は、機構論的分析に基づき、敵対的サフィックスが拒否仲介方向の伝播を如何に抑制するかを明らかにする。さらに本研究は、モデルの内部構造に対する深い理解が、実用的かつ効率的で標的を絞ったモデル行動調整手法の開発に如何に活用できるかを示し、より信頼性高く解釈可能なAI安全性への道筋を示すものである。

English

Large language models (LLMs), despite being safety-aligned, exhibit brittle refusal behaviors that can be circumvented by simple linguistic changes. As tense jailbreaking demonstrates that models refusing harmful requests often comply when rephrased in past tense, a critical generalization gap is revealed in current alignment methods whose underlying mechanisms are poorly understood. In this work, we introduce Activation-Scaling Guard (ASGuard), an insightful, mechanistically-informed framework that surgically mitigates this specific vulnerability. In the first step, we use circuit analysis to identify the specific attention heads causally linked to the targeted jailbreaking such as a tense-changing attack. Second, we train a precise, channel-wise scaling vector to recalibrate the activation of tense vulnerable heads. Lastly, we apply it into a "preventative fine-tuning", forcing the model to learn a more robust refusal mechanism. Across four LLMs, ASGuard effectively reduces the attack success rate of targeted jailbreaking while preserving general capabilities and minimizing over refusal, achieving a Pareto-optimal balance between safety and utility. Our findings underscore how adversarial suffixes suppress the propagation of the refusal-mediating direction, based on mechanistic analysis. Furthermore, our work showcases how a deep understanding of model internals can be leveraged to develop practical, efficient, and targeted methods for adjusting model behavior, charting a course for more reliable and interpretable AI safety.

ASGuard: ターゲット型ジャイルブレイキング攻撃を軽減する活性化スケーリングガード

ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

要旨

Support