DVAO：面向多奖励强化学习的动态方差自适应优势优化

摘要

强化学习已成为将大语言模型与人类意图及任务需求对齐的标准范式。尽管组相对策略优化为近端策略优化提供了一种无需价值模型的高效替代方案，但将其适配到现实世界中多奖励场景仍具挑战性。标准标量化实践，如奖励组合和优势组合，存在显著缺陷：奖励组合经常产生平方量级过大的优势值，导致训练不稳定；而优势组合依赖静态超参数且忽视跨目标相关性。为解决这些局限，我们提出动态方差自适应优势优化（DVAO），该方法基于每个目标在整批采样组内的经验奖励方差动态调整组合权重，有效放大学习信号更强的目标权重，同时抑制噪声目标。我们从数学上证明DVAO能维持有界的优势量级以确保训练稳定，并引入一种自适应的跨目标正则化机制。基于Qwen3和Qwen2.5模型在数学推理和工具使用基准上的大量实验表明，DVAO显著优于基线方法，实现了卓越的多目标帕累托前沿与稳健的训练稳定性。

English

Reinforcement Learning has become a standard paradigm for aligning Large Language Models with human intent and task requirements. While Group Relative Policy Optimization offers an efficient, value-model-free alternative to Proximal Policy Optimization, adapting it to real-world multi-reward settings remains challenging. Standard scalarization practices, such as Reward Combination and Advantage Combination, suffer from significant drawbacks: Reward Combination frequently generates advantages with excessively large squared magnitudes that lead to training instability, while Advantage Combination relies on static hyperparameters and ignores cross-objective correlations. To address these limitations, we propose Dynamic Variance-adaptive Advantage Optimization (DVAO), which dynamically adjusts combination weights based on the empirical reward variance of each objective within a rollout group, effectively up-weighting objectives with a stronger learning signal while suppressing noisy ones. We mathematically prove that DVAO maintains bounded advantage magnitudes for stable training and introduces a self-adaptive cross-objective regularization mechanism. Extensive experiments on mathematical reasoning and tool-use benchmarks using Qwen3 and Qwen2.5 models demonstrate that DVAO significantly outperforms baseline methods, achieving a superior multi-objective Pareto frontier and robust training stability.