ChatPaper.aiChatPaper

基于可验证奖励的多目标强化学习分块优势估计

Blockwise Advantage Estimation for Multi-Objective RL with Verifiable Rewards

February 10, 2026
作者: Kirill Pavlenko, Alexander Golubev, Simon Karasik, Boris Yangel
cs.AI

摘要

群体相对策略优化(GRPO)为生成文本中的所有令牌分配单一标量优势值。对于具有明确段落结构和多目标的生成任务,这种方法会导致不同段落间的奖励信号相互干扰,引发目标冲突和功劳分配失准。我们提出分块优势估计方法——一系列与GRPO兼容的技术方案,通过为每个目标分配独立优势值并仅应用于对应文本块内的令牌,降低对手工设计标量奖励的依赖,并能自然扩展至更多目标。核心挑战在于如何估计后续文本块的优势值,因其奖励取决于已生成的前缀内容;标准的无偏估计方法需要从中间状态进行耗时的嵌套推演。具体而言,我们引入结果条件基线法,通过根据前缀衍生的中间结果对样本分层,仅利用组内统计量近似估计中间状态值。在含不确定性估计的数学任务上,本方法有效缓解了奖励干扰,与最先进的手工奖励方案性能相当,并保持了测试时置信加权集成的增益效果。更广泛而言,该方法为结构化生成中的序列目标优化提供了无需额外推演的模块化解决方案。
English
Group Relative Policy Optimization (GRPO) assigns a single scalar advantage to all tokens in a completion. For structured generations with explicit segments and objectives, this couples unrelated reward signals across segments, leading to objective interference and misattributed credit. We propose Blockwise Advantage Estimation, a family of GRPO-compatible methods that assigns each objective its own advantage and applies it only to the tokens in the corresponding text block, reducing reliance on hand-designed scalar rewards and scaling naturally to additional objectives. A key challenge is estimating advantages for later blocks whose rewards are conditioned on sampled prefixes; standard unbiased approaches require expensive nested rollouts from intermediate states. Concretely, we introduce an Outcome-Conditioned Baseline that approximates intermediate state values using only within-group statistics by stratifying samples according to a prefix-derived intermediate outcome. On math tasks with uncertainty estimation, our method mitigates reward interference, is competitive with a state-of-the-art reward-designed approach, and preserves test-time gains from confidence-weighted ensembling. More broadly, it provides a modular recipe for optimizing sequential objectives in structured generations without additional rollouts.
PDF82February 13, 2026