ChatPaper.aiChatPaper

无奖励式冲突目标对齐

Reward-free Alignment for Conflicting Objectives

February 2, 2026
作者: Peter Chen, Xiaopeng Li, Xi Chen, Tianyi Lin
cs.AI

摘要

直接对齐方法正日益广泛地用于将大语言模型(LLM)与人类偏好对齐。然而,许多现实世界的对齐问题涉及多个相互冲突的目标,简单聚合偏好的方法可能导致训练不稳定和糟糕的权衡效果。具体而言,加权损失方法可能无法识别能同时改善所有目标的更新方向,而现有多目标方法通常依赖显式奖励模型,这会引入额外复杂性并扭曲用户指定的偏好。本文的贡献包含两方面:首先,我们提出面向冲突目标的无奖励对齐框架(RACO),该框架直接利用成对偏好数据,并通过一种新型的冲突规避梯度下降剪裁变体来解决梯度冲突。我们提供了遵循用户指定目标权重的帕累托临界点收敛保证,并进一步证明在双目标场景下剪裁操作能严格提升收敛速率。其次,我们通过启发式策略改进方法,并通过实验验证所提框架在LLM对齐任务中的兼容性。基于多类LLM(Qwen 3、Llama 3、Gemma 3)在多目标摘要生成和安全对齐任务上的定性与定量评估表明,相较于现有多目标对齐基线方法,我们的方案能持续实现更优的帕累托权衡。
English
Direct alignment methods are increasingly used to align large language models (LLMs) with human preferences. However, many real-world alignment problems involve multiple conflicting objectives, where naive aggregation of preferences can lead to unstable training and poor trade-offs. In particular, weighted loss methods may fail to identify update directions that simultaneously improve all objectives, and existing multi-objective approaches often rely on explicit reward models, introducing additional complexity and distorting user-specified preferences. The contributions of this paper are two-fold. First, we propose a Reward-free Alignment framework for Conflicted Objectives (RACO) that directly leverages pairwise preference data and resolves gradient conflicts via a novel clipped variant of conflict-averse gradient descent. We provide convergence guarantees to Pareto-critical points that respect user-specified objective weights, and further show that clipping can strictly improve convergence rate in the two-objective setting. Second, we improve our method using some heuristics and conduct experiments to demonstrate the compatibility of the proposed framework for LLM alignment. Both qualitative and quantitative evaluations on multi-objective summarization and safety alignment tasks across multiple LLM families (Qwen 3, Llama 3, Gemma 3) show that our method consistently achieves better Pareto trade-offs compared to existing multi-objective alignment baselines.
PDF11February 6, 2026