扭轉咒語:通過一階安全注入實現輕量級對齊增強
Turning the Spell Around: Lightweight Alignment Amplification via Rank-One Safety Injection
August 28, 2025
作者: Harethah Abu Shairah, Hasan Abed Al Kader Hammoud, George Turkiyyah, Bernard Ghanem
cs.AI
摘要
大型語言模型(LLMs)的安全對齊通常涉及調節內部表徵以拒絕有害請求。最近的研究表明,通過消除或移除模型中的特定表徵方向,這些安全機制可以被繞過。在本文中,我們提出了一種相反的方法:秩一安全注入(ROSI),這是一種白盒方法,通過永久性地將模型的激活引導至拒絕調節的子空間來增強其安全對齊。ROSI作為一種簡單的、無需微調的秩一權重修改,應用於所有殘差流寫入矩陣。所需的安全方向可以從一小組有害和無害指令對中計算得出。我們展示了ROSI一致性地提高了安全拒絕率——如Llama Guard 3所評估——同時在MMLU、HellaSwag和Arc等標準基準測試中保持了模型的實用性。此外,我們還展示了ROSI可以通過放大其自身潛在的安全方向來重新對齊「未審查」模型,證明其作為有效的最後一英里安全程序的實用性。我們的結果表明,有針對性、可解釋的權重引導是一種低成本且強效的機制,可以提升LLM的安全性,並補充了更耗資源的微調範式。
English
Safety alignment in Large Language Models (LLMs) often involves mediating
internal representations to refuse harmful requests. Recent research has
demonstrated that these safety mechanisms can be bypassed by ablating or
removing specific representational directions within the model. In this paper,
we propose the opposite approach: Rank-One Safety Injection (ROSI), a white-box
method that amplifies a model's safety alignment by permanently steering its
activations toward the refusal-mediating subspace. ROSI operates as a simple,
fine-tuning-free rank-one weight modification applied to all residual stream
write matrices. The required safety direction can be computed from a small set
of harmful and harmless instruction pairs. We show that ROSI consistently
increases safety refusal rates - as evaluated by Llama Guard 3 - while
preserving the utility of the model on standard benchmarks such as MMLU,
HellaSwag, and Arc. Furthermore, we show that ROSI can also re-align
'uncensored' models by amplifying their own latent safety directions,
demonstrating its utility as an effective last-mile safety procedure. Our
results suggest that targeted, interpretable weight steering is a cheap and
potent mechanism to improve LLM safety, complementing more resource-intensive
fine-tuning paradigms.