安全對齊作為持續學習：通過正交梯度投影緩解對齊稅

摘要

安全後訓練能改善大型語言模型（LLMs）的危害性與政策遵循度，但可能同時降低通用能力，此現象常被稱為對齊稅。我們從持續學習的角度研究此權衡：序列式對齊階段使模型暴露於轉移的資料分佈與目標，其梯度可能干擾支撐先前通用能力的方向。此觀點並非主張所有對齊退化皆源自單一原因，而是提供一個實用的一階機制，以減輕一項重要的能力衰退源。我們提出安全對齊的正交梯度投影（OGPSA），這是一個輕量級更新規則，從少量通用能力資料的梯度中估計低秩參考子空間，並從每個安全梯度中移除位於此子空間的分量。所得更新是滿足參考目標一階保留條件下最陡的局部安全下降方向。OGPSA 與標準後訓練流程相容，且避免大規模重播，但會引入週期性的參考梯度計算。在監督式微調（SFT）、直接偏好最佳化（DPO）及序列式 SFT→DPO 設定下，OGPSA 改善了標準基線的觀察安全-效用權衡。在序列式 SFT→DPO 流程中，平均效能增益在 Qwen2.5-7B-Instruct 上從 33.98% 提升至 42.74%，在 Llama3.1-8B-Instruct 上從 19.74% 提升至 32.98%。我們已在 https://github.com/SunGL001/OGPSA 開源程式碼。

English

Safety post-training can improve the harmfulness and policy compliance of Large Language Models (LLMs), but it may also reduce general utility, a phenomenon often described as the alignment tax. We study this trade-off through the lens of continual learning: sequential alignment stages expose the model to shifted data distributions and objectives, and their gradients may interfere with directions that support previously acquired general capabilities. This view does not claim that all alignment degradation has a single cause; rather, it provides a useful first-order mechanism for mitigating one important source of capability regression. We propose Orthogonal Gradient Projection for Safety Alignment (OGPSA), a lightweight update rule that estimates a low-rank reference subspace from gradients on a small set of general-capability data and removes from each safety gradient the component lying in this subspace. The resulting update is the steepest local safety-descent direction subject to first-order preservation constraints on the reference objectives. OGPSA is compatible with standard post-training pipelines and avoids large-scale replay, although it introduces periodic reference-gradient computation. Across Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and sequential SFTrightarrowDPO settings, OGPSA improves the observed safety--utility trade-off over standard baselines. Under the sequential SFTrightarrowDPO pipeline, the average performance gain increases from 33.98\% to 42.74\% on Qwen2.5-7B-Instruct and from 19.74\% to 32.98\% on Llama3.1-8B-Instruct. We have open sourced our code at https://github.com/SunGL001/OGPSA.