ChatPaper.aiChatPaper

扭转咒语:通过秩一安全注入实现轻量级对齐增强

Turning the Spell Around: Lightweight Alignment Amplification via Rank-One Safety Injection

August 28, 2025
作者: Harethah Abu Shairah, Hasan Abed Al Kader Hammoud, George Turkiyyah, Bernard Ghanem
cs.AI

摘要

大型语言模型(LLMs)的安全对齐通常涉及调节内部表征以拒绝有害请求。近期研究表明,通过消融或移除模型中的特定表征方向,这些安全机制可能被绕过。本文提出了一种相反的方法:秩一安全注入(ROSI),这是一种白盒方法,通过永久性地引导模型激活朝向拒绝调节子空间,从而增强模型的安全对齐。ROSI作为一种简单、无需微调的秩一权重修改,应用于所有残差流写入矩阵。所需的安全方向可从少量有害与无害指令对中计算得出。我们证明,ROSI在保持模型在MMLU、HellaSwag和Arc等标准基准测试中实用性的同时,一致性地提高了安全拒绝率——以Llama Guard 3评估为准。此外,我们还展示了ROSI能够通过放大“未审查”模型自身潜在的安全方向来重新对齐它们,证明了其作为有效最终安全措施的价值。我们的结果表明,针对性强、可解释的权重引导是一种成本低廉且效果显著的机制,可提升LLM安全性,与资源密集型的微调范式形成互补。
English
Safety alignment in Large Language Models (LLMs) often involves mediating internal representations to refuse harmful requests. Recent research has demonstrated that these safety mechanisms can be bypassed by ablating or removing specific representational directions within the model. In this paper, we propose the opposite approach: Rank-One Safety Injection (ROSI), a white-box method that amplifies a model's safety alignment by permanently steering its activations toward the refusal-mediating subspace. ROSI operates as a simple, fine-tuning-free rank-one weight modification applied to all residual stream write matrices. The required safety direction can be computed from a small set of harmful and harmless instruction pairs. We show that ROSI consistently increases safety refusal rates - as evaluated by Llama Guard 3 - while preserving the utility of the model on standard benchmarks such as MMLU, HellaSwag, and Arc. Furthermore, we show that ROSI can also re-align 'uncensored' models by amplifying their own latent safety directions, demonstrating its utility as an effective last-mile safety procedure. Our results suggest that targeted, interpretable weight steering is a cheap and potent mechanism to improve LLM safety, complementing more resource-intensive fine-tuning paradigms.
PDF122August 29, 2025