ASGuard：以激活縮放防護機制抵禦目標性越獄攻擊

摘要

儘管大型語言模型（LLMs）已進行安全對齊，仍會表現出脆弱的拒答行為，這種行為可透過簡單的語言改動規避。時態越獄攻擊表明：當有害請求改以過去式表述時，原本拒絕的模型往往會轉為順從，這揭示了當前對齊方法存在關鍵的泛化缺陷，其內在機制尚不明晰。本研究提出「激活縮放防護」（ASGuard），一個基於機理洞察的框架，能精準修復此特定漏洞。首先，我們透過電路分析定位與目標越獄行為（如時態轉換攻擊）因果關聯的特定注意力頭；其次，訓練精確的通道級縮放向量以重新校準時態脆弱頭的激活值；最後，將其應用於「預防性微調」，迫使模型學習更穩健的拒答機制。在四款LLM上的實驗表明，ASGuard能有效降低目標越獄攻擊成功率，同時維持通用能力並減少過度拒答，實現安全與效用的帕雷托最優平衡。基於機理分析，我們發現對抗性後綴會抑制拒答中介方向的訊息傳播。本研究進一步證明，透過深度理解模型內部機制，可開發出實用、高效且具針對性的行為調控方法，為構建更可靠、可解釋的AI安全技術指明方向。

English

Large language models (LLMs), despite being safety-aligned, exhibit brittle refusal behaviors that can be circumvented by simple linguistic changes. As tense jailbreaking demonstrates that models refusing harmful requests often comply when rephrased in past tense, a critical generalization gap is revealed in current alignment methods whose underlying mechanisms are poorly understood. In this work, we introduce Activation-Scaling Guard (ASGuard), an insightful, mechanistically-informed framework that surgically mitigates this specific vulnerability. In the first step, we use circuit analysis to identify the specific attention heads causally linked to the targeted jailbreaking such as a tense-changing attack. Second, we train a precise, channel-wise scaling vector to recalibrate the activation of tense vulnerable heads. Lastly, we apply it into a "preventative fine-tuning", forcing the model to learn a more robust refusal mechanism. Across four LLMs, ASGuard effectively reduces the attack success rate of targeted jailbreaking while preserving general capabilities and minimizing over refusal, achieving a Pareto-optimal balance between safety and utility. Our findings underscore how adversarial suffixes suppress the propagation of the refusal-mediating direction, based on mechanistic analysis. Furthermore, our work showcases how a deep understanding of model internals can be leveraged to develop practical, efficient, and targeted methods for adjusting model behavior, charting a course for more reliable and interpretable AI safety.

ASGuard：以激活縮放防護機制抵禦目標性越獄攻擊

ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

摘要

Support