安全アライメントを継続学習として捉える：直交勾配射影によるアライメント税の軽減

要旨

安全性ポストトレーニングは、大規模言語モデル（LLMs）の有害性低減やポリシー準拠を向上させることができるが、同時に一般的な有用性を低下させる可能性がある。この現象はしばしば「アライメント税」と呼ばれる。本稿では、このトレードオフを継続学習の観点から研究する。すなわち、逐次的なアライメント段階ではモデルがシフトしたデータ分布や目的にさらされ、その勾配が以前に獲得された一般的な能力を支える方向と干渉する可能性がある。この見解は、アライメントによる劣化のすべてに単一の原因があると主張するものではなく、むしろ、能力の後退という重要な原因の一つを緩和するための有用な一次のメカニズムを提供するものである。我々は、安全性アライメントのための直交勾配射影（OGPSA）を提案する。これは、少数の汎用能力データの勾配から低ランクの参照部分空間を推定し、各安全性勾配からこの部分空間に含まれる成分を除去する軽量な更新ルールである。結果として得られる更新は、参照目的に対する一次保存制約の下での最も急な局所的安全性降下方向となる。OGPSAは標準的なポストトレーニングパイプラインと互換性があり、大規模なリプレイを回避する一方、定期的な参照勾配計算を導入する。教師ありファインチューニング（SFT）、直接選好最適化（DPO）、および逐次的なSFT→DPO設定において、OGPSAは標準ベースラインと比較して観測される安全性と有用性のトレードオフを改善する。逐次的なSFT→DPOパイプラインでは、Qwen2.5-7B-Instructで平均性能向上率が33.98％から42.74％に、Llama3.1-8B-Instructで19.74％から32.98％に向上した。我々はコードをhttps://github.com/SunGL001/OGPSAでオープンソース化している。

English

Safety post-training can improve the harmfulness and policy compliance of Large Language Models (LLMs), but it may also reduce general utility, a phenomenon often described as the alignment tax. We study this trade-off through the lens of continual learning: sequential alignment stages expose the model to shifted data distributions and objectives, and their gradients may interfere with directions that support previously acquired general capabilities. This view does not claim that all alignment degradation has a single cause; rather, it provides a useful first-order mechanism for mitigating one important source of capability regression. We propose Orthogonal Gradient Projection for Safety Alignment (OGPSA), a lightweight update rule that estimates a low-rank reference subspace from gradients on a small set of general-capability data and removes from each safety gradient the component lying in this subspace. The resulting update is the steepest local safety-descent direction subject to first-order preservation constraints on the reference objectives. OGPSA is compatible with standard post-training pipelines and avoids large-scale replay, although it introduces periodic reference-gradient computation. Across Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and sequential SFTrightarrowDPO settings, OGPSA improves the observed safety--utility trade-off over standard baselines. Under the sequential SFTrightarrowDPO pipeline, the average performance gain increases from 33.98\% to 42.74\% on Qwen2.5-7B-Instruct and from 19.74\% to 32.98\% on Llama3.1-8B-Instruct. We have open sourced our code at https://github.com/SunGL001/OGPSA.