一次判定，多重响应：单次前向传播中的多响应奖励建模

摘要

我们提出了一种判别式多模态奖励模型，可在单次前向传播中对所有候选响应进行评分。传统的判别式奖励模型需对每个响应独立评估，需执行多次前向传播（每个潜在响应对应一次）。我们的方法通过分隔符将多个响应拼接，并对其标量分数应用交叉熵损失，从而实现直接比较推理和高效的N向偏好学习。这种多响应设计相比传统单响应评分方式，可实现最高达N倍的实时加速和浮点运算量降低。为突破现有成对基准测试的局限，我们构建了两个新基准：(1) MR^2Bench-Image包含对8个不同模型响应的人工标注排序；(2) MR^2Bench-Video是基于视频的大规模奖励基准，源自94K个众包人工对视频问答的成对评判（涵盖19个模型），通过偏好图集成进行去噪处理。两个基准均提供从完整排序中采样的4响应评估变体。基于4B参数视觉语言主干网络，结合LoRA微调和轻量级MLP价值头，我们的模型在六个多模态奖励基准（包括MR^2Bench-Image、MR^2Bench-Video和四个现有基准）上取得最先进成果，其表现优于现有更大的生成式和判别式奖励模型。我们进一步证明，当该奖励模型与GRPO结合用于强化学习时，能生成改进的策略模型——这些模型在标准多模态基准上保持性能的同时，显著提升开放式生成质量，在训练稳定性和开放式生成质量上大幅超越单响应判别式奖励模型基线。

English

We present a discriminative multimodal reward model that scores all candidate responses in a single forward pass. Conventional discriminative reward models evaluate each response independently, requiring multiple forward passes, one for each potential response. Our approach concatenates multiple responses with separator tokens and applies cross-entropy over their scalar scores, enabling direct comparative reasoning and efficient N-way preference learning. The multi-response design also yields up to Ntimes wall-clock speedup and FLOPs reduction over conventional single-response scoring. To enable N-way reward evaluation beyond existing pairwise benchmarks, we construct two new benchmarks: (1) MR^2Bench-Image contains human-annotated rankings over responses from 8 diverse models; (2) MR^2Bench-Video is a large-scale video-based reward benchmark derived from 94K crowdsourced pairwise human judgments over video question-answering spanning 19 models, denoised via preference graph ensemble. Both benchmarks provide 4-response evaluation variants sampled from the full rankings. Built on a 4B vision-language backbone with LoRA fine-tuning and a lightweight MLP value head, our model achieves state-of-the-art results on six multimodal reward benchmarks, including MR^2Bench-Image, MR^2Bench-Video, and four other existing benchmarks. Our model outperforms existing larger generative and discriminative reward models. We further demonstrate that our reward model, when used in reinforcement learning with GRPO, produces improved policy models that maintain performance across standard multimodal benchmarks while substantially improving open-ended generation quality, outperforming a single-response discriminative reward model (RM) baseline by a large margin in both training stability and open-ended generation quality.