一步裁决：单次前向传播中的多响应奖励建模

摘要

我们提出了一种判别式多模态奖励模型，能够通过单次前向传播对所有候选响应进行评分。传统的判别式奖励模型需对每个响应独立评估，需执行多次前向传播（每个潜在响应对应一次）。我们的方法通过分隔符将多个响应拼接，并对其标量分数应用交叉熵损失，从而实现直接比较推理和高效的N向偏好学习。该多响应设计相比传统单响应评分方式，可实现最高达N倍的实时加速和浮点运算量降低。为突破现有成对基准测试的限制，我们构建了两个新基准：(1) MR^2Bench-Image包含对8个不同模型响应的人工标注排序；(2) MR^2Bench-Video是基于视频的大规模奖励基准，源自94K个众包人工对视频问答任务的成对评判（涵盖19个模型），并通过偏好图集成进行去噪处理。两个基准均提供从完整排序中采样的4响应评估变体。基于4B参数视觉语言主干网络，结合LoRA微调和轻量级MLP价值头，我们的模型在六大多模态奖励基准（包括MR^2Bench-Image、MR^2Bench-Video和四个现有基准）上取得了最先进的结果，其性能优于现有更大的生成式和判别式奖励模型。我们进一步证明，当该奖励模型与GRPO结合用于强化学习时，能训练出改进的策略模型——这些模型在标准多模态基准上保持性能的同时，显著提升开放生成质量，在训练稳定性和开放生成质量两方面均大幅超越单响应判别式奖励模型基线。

English

We present a discriminative multimodal reward model that scores all candidate responses in a single forward pass. Conventional discriminative reward models evaluate each response independently, requiring multiple forward passes, one for each potential response. Our approach concatenates multiple responses with separator tokens and applies cross-entropy over their scalar scores, enabling direct comparative reasoning and efficient N-way preference learning. The multi-response design also yields up to Ntimes wall-clock speedup and FLOPs reduction over conventional single-response scoring. To enable N-way reward evaluation beyond existing pairwise benchmarks, we construct two new benchmarks: (1) MR^2Bench-Image contains human-annotated rankings over responses from 8 diverse models; (2) MR^2Bench-Video is a large-scale video-based reward benchmark derived from 94K crowdsourced pairwise human judgments over video question-answering spanning 19 models, denoised via preference graph ensemble. Both benchmarks provide 4-response evaluation variants sampled from the full rankings. Built on a 4B vision-language backbone with LoRA fine-tuning and a lightweight MLP value head, our model achieves state-of-the-art results on six multimodal reward benchmarks, including MR^2Bench-Image, MR^2Bench-Video, and four other existing benchmarks. Our model outperforms existing larger generative and discriminative reward models. We further demonstrate that our reward model, when used in reinforcement learning with GRPO, produces improved policy models that maintain performance across standard multimodal benchmarks while substantially improving open-ended generation quality, outperforming a single-response discriminative reward model (RM) baseline by a large margin in both training stability and open-ended generation quality.

一步裁决：单次前向传播中的多响应奖励建模

You Only Judge Once: Multi-response Reward Modeling in a Single Forward Pass

摘要

Support