DVAO：多獎勵強化學習的動態方差自適應優勢優化

摘要

强化学习已成为将大型语言模型与人类意图及任务需求对齐的标准范式。尽管组相对策略优化为近端策略优化提供了一种高效、无需价值模型的替代方案，但将其适配到现实世界中多奖励场景下仍面临挑战。传统的标量化方法，如奖励组合和优势组合，存在显著缺陷：奖励组合频繁产生平方幅度过大的优势，导致训练不稳定；而优势组合则依赖静态超参数，并忽略跨目标的相关性。为解决这些局限性，我们提出动态方差自适应优势优化（DVAO），该方法根据每个目标在 rollout 组内的经验奖励方差动态调整组合权重，有效提升具有更强学习信号的目标权重，同时抑制噪声目标。我们从数学上证明，DVAO 能保持有界的优势幅度以实现稳定训练，并引入一种自适应的跨目标正则化机制。在 Qwen3 和 Qwen2.5 模型上进行的数学推理与工具使用基准的广泛实验表明，DVAO 显著优于基线方法，实现了更优的多目标帕累托前沿及稳健的训练稳定性。

English

Reinforcement Learning has become a standard paradigm for aligning Large Language Models with human intent and task requirements. While Group Relative Policy Optimization offers an efficient, value-model-free alternative to Proximal Policy Optimization, adapting it to real-world multi-reward settings remains challenging. Standard scalarization practices, such as Reward Combination and Advantage Combination, suffer from significant drawbacks: Reward Combination frequently generates advantages with excessively large squared magnitudes that lead to training instability, while Advantage Combination relies on static hyperparameters and ignores cross-objective correlations. To address these limitations, we propose Dynamic Variance-adaptive Advantage Optimization (DVAO), which dynamically adjusts combination weights based on the empirical reward variance of each objective within a rollout group, effectively up-weighting objectives with a stronger learning signal while suppressing noisy ones. We mathematically prove that DVAO maintains bounded advantage magnitudes for stable training and introduces a self-adaptive cross-objective regularization mechanism. Extensive experiments on mathematical reasoning and tool-use benchmarks using Qwen3 and Qwen2.5 models demonstrate that DVAO significantly outperforms baseline methods, achieving a superior multi-objective Pareto frontier and robust training stability.