ChatPaper.aiChatPaper

多模态奖励基准2.0:面向交错式图文内容的综合奖励模型评估体系

Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image

December 18, 2025
作者: Yushi Hu, Reyhane Askari-Hemmat, Melissa Hall, Emily Dinan, Luke Zettlemoyer, Marjan Ghazvininejad
cs.AI

摘要

獎勵模型(RM)在訓練大型語言模型(LLM)中至關重要,但對於處理交錯圖像與文本序列的全能模型而言,其研究仍顯不足。我們推出多模態獎勵基準2(MMRB2),這是首個針對多模態理解與(交錯式)生成任務的獎勵模型綜合基準。MMRB2涵蓋四項任務:文本生成圖像、圖像編輯、交錯生成及多模態推理(「圖像化思考」),每項任務提供來自23個模型與智能體、跨越21項源任務的1,000組專家標注偏好對。MMRB2的設計特點包括:(1)實用且具挑戰性的提示;(2)來自頂尖模型與智能體的回應;(3)通過集成過濾策略篩選、具備強烈人類專家共識的偏好對。利用MMRB2,我們針對各子任務評估現有評判器,包括多模態LLM即時評判器及經人類偏好訓練的模型。最新Gemini 3 Pro準確率達75-80%,GPT-5與Gemini 2.5 Pro準確率為66-75%(人類準確率>90%),但仍優於廣泛使用的GPT-4o(59%)。表現最佳的開源模型Qwen3-VL-32B達到與Gemini 2.5 Flash相近的準確率(64%)。我們亦證實MMRB2表現與基於Best-of-N採樣的下游任務成功率高度相關,並透過深入分析指出獎勵模型未來需改進的關鍵領域。
English
Reward models (RMs) are essential for training large language models (LLMs), but remain underexplored for omni models that handle interleaved image and text sequences. We introduce Multimodal RewardBench 2 (MMRB2), the first comprehensive benchmark for reward models on multimodal understanding and (interleaved) generation. MMRB2 spans four tasks: text-to-image, image editing, interleaved generation, and multimodal reasoning ("thinking-with-images"), providing 1,000 expert-annotated preference pairs per task from 23 models and agents across 21 source tasks. MMRB2 is designed with: (1) practical but challenging prompts; (2) responses from state-of-the-art models and agents; and (3) preference pairs with strong human-expert consensus, curated via an ensemble filtering strategy. Using MMRB2, we study existing judges for each subtask, including multimodal LLM-as-a-judge and models trained with human preferences. The latest Gemini 3 Pro attains 75-80% accuracy. GPT-5 and Gemini 2.5 Pro reach 66-75% accuracy, compared to >90% for humans, yet surpass the widely used GPT-4o (59%). The best performing open-source model Qwen3-VL-32B achieves similar accuracies as Gemini 2.5 Flash (64%). We also show that MMRB2 performance strongly correlates with downstream task success using Best-of-N sampling and conduct an in-depth analysis that shows key areas to improve the reward models going forward.
PDF92December 20, 2025