通過動態獎勵權重學習優化多目標對齊
Learning to Optimize Multi-Objective Alignment Through Dynamic Reward Weighting
September 14, 2025
作者: Yining Lu, Zilong Wang, Shiyang Li, Xin Liu, Changlong Yu, Qingyu Yin, Zhan Shi, Zixuan Zhang, Meng Jiang
cs.AI
摘要
以往的多目標強化學習研究通常採用固定權重的線性獎勵標量化方法,這種方法被證明無法捕捉非凸的帕累托前沿,從而導致次優結果。這一限制在大型語言模型的線上偏好對齊中尤為關鍵。在這裡,由參數化策略生成的隨機軌跡創造了從參數到目標的高度非線性和非凸映射,沒有任何單一的靜態權重方案能夠找到最佳權衡。我們通過引入動態獎勵權重來解決這一限制,該方法在線上強化學習過程中自適應地調整獎勵權重。與依賴固定權重插值的現有方法不同,我們的動態權重在訓練中持續平衡和優先考慮目標,促進在目標空間中有效探索帕累托前沿。我們介紹了兩種日益複雜且可推廣的方法:(1) 超體積引導的權重適應和 (2) 基於梯度的權重優化,為線上多目標對齊提供了一個多功能工具包。我們的大量實驗展示了它們與常用線上強化學習算法(包括GRPO、REINFORCE和RLOO)的兼容性、在多個數學推理數據集上的有效性,以及對不同模型家族的適用性,始終以比固定權重線性標量化基線更少的訓練步驟實現帕累托主導解。
English
Prior works in multi-objective reinforcement learning typically use linear
reward scalarization with fixed weights, which provably fail to capture
non-convex Pareto fronts and thus yield suboptimal results. This limitation
becomes especially critical in online preference alignment for large language
models. Here, stochastic trajectories generated by parameterized policies
create highly non-linear and non-convex mappings from parameters to objectives
that no single static weighting scheme can find optimal trade-offs. We address
this limitation by introducing dynamic reward weighting, which adaptively
adjusts reward weights during the online reinforcement learning process. Unlike
existing approaches that rely on fixed-weight interpolation, our dynamic
weighting continuously balances and prioritizes objectives in training,
facilitating effective exploration of Pareto fronts in objective space. We
introduce two approaches of increasing sophistication and generalizability: (1)
hypervolume-guided weight adaptation and (2) gradient-based weight
optimization, offering a versatile toolkit for online multi-objective
alignment. Our extensive experiments demonstrate their compatibility with
commonly used online reinforcement learning algorithms (including GRPO,
REINFORCE, and RLOO), effectiveness across multiple mathematical reasoning
datasets, and applicability to different model families, consistently achieving
Pareto dominant solutions with fewer training steps than fixed-weight linear
scalarization baselines.