ChatPaper.aiChatPaper

多模态奖励基准2:评估交错式文本与图像的全能奖励模型

Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image

December 18, 2025
作者: Yushi Hu, Reyhane Askari-Hemmat, Melissa Hall, Emily Dinan, Luke Zettlemoyer, Marjan Ghazvininejad
cs.AI

摘要

奖励模型(RMs)对大语言模型(LLMs)的训练至关重要,但在处理交错图像与文本序列的全能模型领域仍待深入探索。我们推出多模态奖励基准2(MMRB2),这是首个针对多模态理解与(交错)生成任务的综合奖励模型评估体系。MMRB2涵盖四大任务:文生图、图像编辑、交错生成及多模态推理("图像思维"),每个任务包含来自23个模型和智能体在21项源任务中产生的1000对专家标注偏好数据。该基准具有三大设计特点:(1)实用且具挑战性的提示词;(2)汇集顶尖模型与智能体的响应;(3)通过集成过滤策略筛选出具有强人类专家共识的偏好对。基于MMRB2,我们评估了各子任务的现有评判器,包括多模态LLM即评判器及经人类偏好训练的模型。最新Gemini 3 Pro准确率达75-80%,GPT-5与Gemini 2.5 Pro达到66-75%(人类水平>90%),但已超越广泛使用的GPT-4o(59%)。最佳开源模型Qwen3-VL-32B取得与Gemini 2.5 Flash相当的准确率(64%)。我们通过N选优采样证明MMRB2表现与下游任务成功率高度相关,并深入分析指出奖励模型未来改进的关键方向。
English
Reward models (RMs) are essential for training large language models (LLMs), but remain underexplored for omni models that handle interleaved image and text sequences. We introduce Multimodal RewardBench 2 (MMRB2), the first comprehensive benchmark for reward models on multimodal understanding and (interleaved) generation. MMRB2 spans four tasks: text-to-image, image editing, interleaved generation, and multimodal reasoning ("thinking-with-images"), providing 1,000 expert-annotated preference pairs per task from 23 models and agents across 21 source tasks. MMRB2 is designed with: (1) practical but challenging prompts; (2) responses from state-of-the-art models and agents; and (3) preference pairs with strong human-expert consensus, curated via an ensemble filtering strategy. Using MMRB2, we study existing judges for each subtask, including multimodal LLM-as-a-judge and models trained with human preferences. The latest Gemini 3 Pro attains 75-80% accuracy. GPT-5 and Gemini 2.5 Pro reach 66-75% accuracy, compared to >90% for humans, yet surpass the widely used GPT-4o (59%). The best performing open-source model Qwen3-VL-32B achieves similar accuracies as Gemini 2.5 Flash (64%). We also show that MMRB2 performance strongly correlates with downstream task success using Best-of-N sampling and conduct an in-depth analysis that shows key areas to improve the reward models going forward.
PDF92December 20, 2025