ASGuard：基于激活缩放的防护机制以抵御针对性越狱攻击

摘要

尽管经过安全对齐，大型语言模型仍表现出脆性拒绝行为，这种防御机制可能因简单的语言变换而被绕过。时态越狱现象表明，当有害请求被改写为过去时态时，原本拒绝的模型往往会转为顺从，这揭示了当前对齐方法存在关键泛化缺陷，其内在机制尚不明确。本研究提出激活缩放防护框架，这一基于机制解析的精准框架能针对性修复此类漏洞。首先通过电路分析定位与时态转换攻击等特定越狱行为存在因果关联的注意力头；其次训练精确的通道级缩放向量，重新校准时态敏感头的激活值；最后将其融入"预防性微调"，迫使模型学习更稳健的拒绝机制。在四个大型语言模型上的实验表明，该框架能有效降低目标越狱攻击成功率，同时保持通用能力并最小化过度拒绝，实现了安全性与实用性的帕累托最优平衡。基于机制分析，我们发现对抗性后缀会抑制拒绝中介方向的信号传播。本研究进一步证明，通过深入理解模型内部机制，可开发出实用高效的行为调控方法，为构建更可靠、可解释的AI安全体系指明方向。

English

Large language models (LLMs), despite being safety-aligned, exhibit brittle refusal behaviors that can be circumvented by simple linguistic changes. As tense jailbreaking demonstrates that models refusing harmful requests often comply when rephrased in past tense, a critical generalization gap is revealed in current alignment methods whose underlying mechanisms are poorly understood. In this work, we introduce Activation-Scaling Guard (ASGuard), an insightful, mechanistically-informed framework that surgically mitigates this specific vulnerability. In the first step, we use circuit analysis to identify the specific attention heads causally linked to the targeted jailbreaking such as a tense-changing attack. Second, we train a precise, channel-wise scaling vector to recalibrate the activation of tense vulnerable heads. Lastly, we apply it into a "preventative fine-tuning", forcing the model to learn a more robust refusal mechanism. Across four LLMs, ASGuard effectively reduces the attack success rate of targeted jailbreaking while preserving general capabilities and minimizing over refusal, achieving a Pareto-optimal balance between safety and utility. Our findings underscore how adversarial suffixes suppress the propagation of the refusal-mediating direction, based on mechanistic analysis. Furthermore, our work showcases how a deep understanding of model internals can be leveraged to develop practical, efficient, and targeted methods for adjusting model behavior, charting a course for more reliable and interpretable AI safety.

ASGuard：基于激活缩放的防护机制以抵御针对性越狱攻击

ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

摘要

Support