扭轉咒語：通過一階安全注入實現輕量級對齊增強

摘要

大型語言模型（LLMs）的安全對齊通常涉及調節內部表徵以拒絕有害請求。最近的研究表明，通過消除或移除模型中的特定表徵方向，這些安全機制可以被繞過。在本文中，我們提出了一種相反的方法：秩一安全注入（ROSI），這是一種白盒方法，通過永久性地將模型的激活引導至拒絕調節的子空間來增強其安全對齊。ROSI作為一種簡單的、無需微調的秩一權重修改，應用於所有殘差流寫入矩陣。所需的安全方向可以從一小組有害和無害指令對中計算得出。我們展示了ROSI一致性地提高了安全拒絕率——如Llama Guard 3所評估——同時在MMLU、HellaSwag和Arc等標準基準測試中保持了模型的實用性。此外，我們還展示了ROSI可以通過放大其自身潛在的安全方向來重新對齊「未審查」模型，證明其作為有效的最後一英里安全程序的實用性。我們的結果表明，有針對性、可解釋的權重引導是一種低成本且強效的機制，可以提升LLM的安全性，並補充了更耗資源的微調範式。

English

Safety alignment in Large Language Models (LLMs) often involves mediating internal representations to refuse harmful requests. Recent research has demonstrated that these safety mechanisms can be bypassed by ablating or removing specific representational directions within the model. In this paper, we propose the opposite approach: Rank-One Safety Injection (ROSI), a white-box method that amplifies a model's safety alignment by permanently steering its activations toward the refusal-mediating subspace. ROSI operates as a simple, fine-tuning-free rank-one weight modification applied to all residual stream write matrices. The required safety direction can be computed from a small set of harmful and harmless instruction pairs. We show that ROSI consistently increases safety refusal rates - as evaluated by Llama Guard 3 - while preserving the utility of the model on standard benchmarks such as MMLU, HellaSwag, and Arc. Furthermore, we show that ROSI can also re-align 'uncensored' models by amplifying their own latent safety directions, demonstrating its utility as an effective last-mile safety procedure. Our results suggest that targeted, interpretable weight steering is a cheap and potent mechanism to improve LLM safety, complementing more resource-intensive fine-tuning paradigms.

扭轉咒語：通過一階安全注入實現輕量級對齊增強

Turning the Spell Around: Lightweight Alignment Amplification via Rank-One Safety Injection

摘要

Support