안전 정렬을 지속적 학습으로: 직교 기울기 투영을 통한 정렬 비용 완화

초록

안전성 후속 학습은 대규모 언어 모델(LLM)의 유해성 감소와 정책 준수성을 향상시킬 수 있지만, 동시에 일반적인 유용성을 저하시킬 수 있으며, 이 현상은 종종 정렬 비용(alignment tax)으로 설명됩니다. 우리는 이 상충 관계를 지속적 학습(continual learning)의 관점에서 연구합니다: 연속적인 정렬 단계는 모델을 변화된 데이터 분포와 목표에 노출시키며, 이들의 기울기는 이전에 획득한 일반 능력을 뒷받침하는 방향과 간섭할 수 있습니다. 이 관점은 모든 정렬 저하가 단일 원인에 기인한다고 주장하지 않으며, 오히려 능력 회귀의 한 중요한 원인을 완화하기 위한 유용한 일차 메커니즘을 제공합니다. 우리는 안전 정렬을 위한 직교 기울기 투영(Orthogonal Gradient Projection for Safety Alignment, OGPSA)이라는 경량 업데이트 규칙을 제안합니다. 이는 소량의 일반 능력 데이터에 대한 기울기로부터 저차원 참조 부분공간을 추정하고, 각 안전 기울기에서 이 부분공간에 속하는 성분을 제거합니다. 결과적으로 얻어지는 업데이트는 참조 목표에 대한 일차 보존 제약 조건 하에서 가장 가파른 국소 안전 하강 방향입니다. OGPSA는 표준 후속 학습 파이프라인과 호환되며 대규모 재생을 피하지만, 주기적인 참조 기울기 계산을 도입합니다. 지도 학습 미세 조정(SFT), 직접 선호 최적화(DPO), 그리고 순차적 SFT→DPO 설정에서 OGPSA는 표준 기준선 대비 관찰된 안전성-유용성 상충 관계를 개선합니다. 순차적 SFT→DPO 파이프라인에서 Qwen2.5-7B-Instruct의 평균 성능 향상은 33.98%에서 42.74%로, Llama3.1-8B-Instruct의 경우 19.74%에서 32.98%로 증가했습니다. 우리는 코드를 https://github.com/SunGL001/OGPSA에서 오픈소스로 공개했습니다.

English

Safety post-training can improve the harmfulness and policy compliance of Large Language Models (LLMs), but it may also reduce general utility, a phenomenon often described as the alignment tax. We study this trade-off through the lens of continual learning: sequential alignment stages expose the model to shifted data distributions and objectives, and their gradients may interfere with directions that support previously acquired general capabilities. This view does not claim that all alignment degradation has a single cause; rather, it provides a useful first-order mechanism for mitigating one important source of capability regression. We propose Orthogonal Gradient Projection for Safety Alignment (OGPSA), a lightweight update rule that estimates a low-rank reference subspace from gradients on a small set of general-capability data and removes from each safety gradient the component lying in this subspace. The resulting update is the steepest local safety-descent direction subject to first-order preservation constraints on the reference objectives. OGPSA is compatible with standard post-training pipelines and avoids large-scale replay, although it introduces periodic reference-gradient computation. Across Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and sequential SFTrightarrowDPO settings, OGPSA improves the observed safety--utility trade-off over standard baselines. Under the sequential SFTrightarrowDPO pipeline, the average performance gain increases from 33.98\% to 42.74\% on Qwen2.5-7B-Instruct and from 19.74\% to 32.98\% on Llama3.1-8B-Instruct. We have open sourced our code at https://github.com/SunGL001/OGPSA.