将安全对齐作为持续学习:通过正交梯度投影缓解对齐代价
Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection
May 12, 2026
作者: Guanglong Sun, Siyuan Zhang, Liyuan Wang, Jun Zhu, Hang Su, Yi Zhong
cs.AI
摘要
安全后训练能够提升大语言模型的无害性和策略依从性,但可能同时降低通用能力,这一现象常被称为对齐代价。我们通过持续学习的视角研究这一权衡:连续的对齐阶段使模型面临偏移的数据分布和目标,其梯度可能干扰支持先前获取的通用能力的方向。这种观点并未声称所有对齐退化都有单一成因,而是为缓解其中一种重要的能力回退机制提供了有用的一阶方法。我们提出正交梯度投影安全对齐方法(OGPSA),这是一种轻量级更新规则,通过从小量通用能力数据上的梯度估计低秩参考子空间,从每个安全梯度中移除位于该子空间的分量。所得更新是在参考目标的一阶保持约束下最陡的局部安全下降方向。OGPSA兼容标准后训练流程,无需大规模回放,但会引入周期性的参考梯度计算。在监督微调(SFT)、直接偏好优化(DPO)及顺序SFT→DPO设定下,OGPSA相比标准基线改善了观测到的安全-效用权衡。在顺序SFT→DPO流程下,Qwen2.5-7B-Instruct的平均性能增益从33.98%提升至42.74%,Llama3.1-8B-Instruct从19.74%提升至32.98%。我们已在 https://github.com/SunGL001/OGPSA 开源了代码。
English
Safety post-training can improve the harmfulness and policy compliance of Large Language Models (LLMs), but it may also reduce general utility, a phenomenon often described as the alignment tax. We study this trade-off through the lens of continual learning: sequential alignment stages expose the model to shifted data distributions and objectives, and their gradients may interfere with directions that support previously acquired general capabilities. This view does not claim that all alignment degradation has a single cause; rather, it provides a useful first-order mechanism for mitigating one important source of capability regression. We propose Orthogonal Gradient Projection for Safety Alignment (OGPSA), a lightweight update rule that estimates a low-rank reference subspace from gradients on a small set of general-capability data and removes from each safety gradient the component lying in this subspace. The resulting update is the steepest local safety-descent direction subject to first-order preservation constraints on the reference objectives. OGPSA is compatible with standard post-training pipelines and avoids large-scale replay, although it introduces periodic reference-gradient computation. Across Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and sequential SFTrightarrowDPO settings, OGPSA improves the observed safety--utility trade-off over standard baselines. Under the sequential SFTrightarrowDPO pipeline, the average performance gain increases from 33.98\% to 42.74\% on Qwen2.5-7B-Instruct and from 19.74\% to 32.98\% on Llama3.1-8B-Instruct. We have open sourced our code at https://github.com/SunGL001/OGPSA.