无奖励式冲突目标对齐
Reward-free Alignment for Conflicting Objectives
February 2, 2026
作者: Peter Chen, Xiaopeng Li, Xi Chen, Tianyi Lin
cs.AI
摘要
直接對齊方法正日益廣泛地用於使大型語言模型(LLMs)與人類偏好保持一致。然而,許多現實世界的對齊問題涉及多個相互衝突的目標,此時若簡單聚合偏好可能導致訓練不穩定和糟糕的權衡結果。具體而言,加權損失方法可能無法識別能同時改善所有目標的更新方向,而現有的多目標方法通常依賴顯式獎勵模型,這會引入額外複雜性並扭曲用戶指定的偏好。本文的貢獻有兩方面:首先,我們提出一種面向衝突目標的無獎勵對齊框架(RACO),該框架直接利用成對偏好數據,並通過一種新穎的剪裁式衝突規避梯度下降法來解決梯度衝突。我們提供了收斂至尊重用戶指定目標權重的帕累托臨界點的理論保證,並進一步證明在雙目標設定中剪裁操作能嚴格提升收斂速率。其次,我們通過啟發式策略改進方法,並通過實驗驗證所提框架在LLM對齊任務中的適用性。在多目標摘要任務和安全對齊任務上,針對多種LLM系列(Qwen 3、Llama 3、Gemma 3)進行的定性與定量評估均表明,相較現有多目標對齊基線方法,本方法能持續實現更優的帕累托權衡。
English
Direct alignment methods are increasingly used to align large language models (LLMs) with human preferences. However, many real-world alignment problems involve multiple conflicting objectives, where naive aggregation of preferences can lead to unstable training and poor trade-offs. In particular, weighted loss methods may fail to identify update directions that simultaneously improve all objectives, and existing multi-objective approaches often rely on explicit reward models, introducing additional complexity and distorting user-specified preferences. The contributions of this paper are two-fold. First, we propose a Reward-free Alignment framework for Conflicted Objectives (RACO) that directly leverages pairwise preference data and resolves gradient conflicts via a novel clipped variant of conflict-averse gradient descent. We provide convergence guarantees to Pareto-critical points that respect user-specified objective weights, and further show that clipping can strictly improve convergence rate in the two-objective setting. Second, we improve our method using some heuristics and conduct experiments to demonstrate the compatibility of the proposed framework for LLM alignment. Both qualitative and quantitative evaluations on multi-objective summarization and safety alignment tasks across multiple LLM families (Qwen 3, Llama 3, Gemma 3) show that our method consistently achieves better Pareto trade-offs compared to existing multi-objective alignment baselines.