ChatPaper.aiChatPaper

ASGuard:以激活縮放防護機制抵禦目標性越獄攻擊

ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

April 14, 2026
作者: Yein Park, Jungwoo Park, Jaewoo Kang
cs.AI

摘要

儘管大型語言模型(LLMs)已進行安全對齊,仍會表現出脆弱的拒答行為,這種行為可透過簡單的語言改動規避。時態越獄攻擊表明:當有害請求改以過去式表述時,原本拒絕的模型往往會轉為順從,這揭示了當前對齊方法存在關鍵的泛化缺陷,其內在機制尚不明晰。本研究提出「激活縮放防護」(ASGuard),一個基於機理洞察的框架,能精準修復此特定漏洞。首先,我們透過電路分析定位與目標越獄行為(如時態轉換攻擊)因果關聯的特定注意力頭;其次,訓練精確的通道級縮放向量以重新校準時態脆弱頭的激活值;最後,將其應用於「預防性微調」,迫使模型學習更穩健的拒答機制。在四款LLM上的實驗表明,ASGuard能有效降低目標越獄攻擊成功率,同時維持通用能力並減少過度拒答,實現安全與效用的帕雷托最優平衡。基於機理分析,我們發現對抗性後綴會抑制拒答中介方向的訊息傳播。本研究進一步證明,透過深度理解模型內部機制,可開發出實用、高效且具針對性的行為調控方法,為構建更可靠、可解釋的AI安全技術指明方向。
English
Large language models (LLMs), despite being safety-aligned, exhibit brittle refusal behaviors that can be circumvented by simple linguistic changes. As tense jailbreaking demonstrates that models refusing harmful requests often comply when rephrased in past tense, a critical generalization gap is revealed in current alignment methods whose underlying mechanisms are poorly understood. In this work, we introduce Activation-Scaling Guard (ASGuard), an insightful, mechanistically-informed framework that surgically mitigates this specific vulnerability. In the first step, we use circuit analysis to identify the specific attention heads causally linked to the targeted jailbreaking such as a tense-changing attack. Second, we train a precise, channel-wise scaling vector to recalibrate the activation of tense vulnerable heads. Lastly, we apply it into a "preventative fine-tuning", forcing the model to learn a more robust refusal mechanism. Across four LLMs, ASGuard effectively reduces the attack success rate of targeted jailbreaking while preserving general capabilities and minimizing over refusal, achieving a Pareto-optimal balance between safety and utility. Our findings underscore how adversarial suffixes suppress the propagation of the refusal-mediating direction, based on mechanistic analysis. Furthermore, our work showcases how a deep understanding of model internals can be leveraged to develop practical, efficient, and targeted methods for adjusting model behavior, charting a course for more reliable and interpretable AI safety.
PDF172April 18, 2026