ChatPaper.aiChatPaper

通过动态奖励权重学习优化多目标对齐

Learning to Optimize Multi-Objective Alignment Through Dynamic Reward Weighting

September 14, 2025
作者: Yining Lu, Zilong Wang, Shiyang Li, Xin Liu, Changlong Yu, Qingyu Yin, Zhan Shi, Zixuan Zhang, Meng Jiang
cs.AI

摘要

以往的多目标强化学习研究通常采用固定权重的线性奖励标量化方法,这种方法已被证明无法捕捉非凸的帕累托前沿,因而导致次优结果。这一局限在大语言模型的在线偏好对齐中尤为关键。在此场景下,由参数化策略生成的随机轨迹在参数到目标之间形成了高度非线性和非凸的映射关系,任何单一的静态权重方案都无法找到最优的权衡。我们通过引入动态奖励权重来解决这一局限,该方法在在线强化学习过程中自适应地调整奖励权重。与依赖固定权重插值的现有方法不同,我们的动态权重在训练过程中持续平衡并优先考虑各项目标,从而促进在目标空间中有效探索帕累托前沿。我们提出了两种逐步复杂且更具通用性的方法:(1) 基于超体积的权重适应和 (2) 基于梯度的权重优化,为在线多目标对齐提供了一个多功能工具包。我们的大量实验表明,这些方法与常用的在线强化学习算法(包括GRPO、REINFORCE和RLOO)兼容,在多个数学推理数据集上均表现出有效性,并适用于不同的模型系列,相较于固定权重的线性标量化基线,能够以更少的训练步骤持续获得帕累托占优解。
English
Prior works in multi-objective reinforcement learning typically use linear reward scalarization with fixed weights, which provably fail to capture non-convex Pareto fronts and thus yield suboptimal results. This limitation becomes especially critical in online preference alignment for large language models. Here, stochastic trajectories generated by parameterized policies create highly non-linear and non-convex mappings from parameters to objectives that no single static weighting scheme can find optimal trade-offs. We address this limitation by introducing dynamic reward weighting, which adaptively adjusts reward weights during the online reinforcement learning process. Unlike existing approaches that rely on fixed-weight interpolation, our dynamic weighting continuously balances and prioritizes objectives in training, facilitating effective exploration of Pareto fronts in objective space. We introduce two approaches of increasing sophistication and generalizability: (1) hypervolume-guided weight adaptation and (2) gradient-based weight optimization, offering a versatile toolkit for online multi-objective alignment. Our extensive experiments demonstrate their compatibility with commonly used online reinforcement learning algorithms (including GRPO, REINFORCE, and RLOO), effectiveness across multiple mathematical reasoning datasets, and applicability to different model families, consistently achieving Pareto dominant solutions with fewer training steps than fixed-weight linear scalarization baselines.
PDF133September 16, 2025