RBench-V：多模態輸出視覺推理模型的首選評估基準

摘要

原生多模態模型和全能模型的快速發展，以GPT-4o、Gemini和o3為代表，這些模型具備處理和生成跨模態內容（如文本和圖像）的能力，標誌著智能演進的一個重要里程碑。系統性地評估這些模型在視覺思維過程（亦稱多模態思維鏈，M-CoT）中的多模態輸出能力變得至關重要。然而，現有的多模態模型評估基準主要集中於評估多模態輸入和純文本推理，而忽視了通過多模態輸出進行推理的重要性。本文提出了一個名為RBench-V的基準，旨在評估模型的視覺不可或缺推理能力。為了構建RBench-V，我們精心挑選了803個問題，涵蓋數學、物理、計數和遊戲等領域。與以往基準通常指定某些輸入模態不同，RBench-V提出的問題以多模態輸出為核心，需要進行圖像操作，如生成新圖像和構建輔助線以支持推理過程。我們在RBench-V上評估了多個開源和閉源模型，包括o3、Gemini 2.5 Pro、Qwen2.5-VL等。即使表現最佳的模型o3，在RBench-V上的準確率也僅為25.8%，遠低於人類的82.3%，這表明當前模型在多模態推理方面仍面臨挑戰。數據和代碼可在https://evalmodels.github.io/rbenchv獲取。

English

The rapid advancement of native multi-modal models and omni-models, exemplified by GPT-4o, Gemini, and o3, with their capability to process and generate content across modalities such as text and images, marks a significant milestone in the evolution of intelligence. Systematic evaluation of their multi-modal output capabilities in visual thinking processes (also known as multi-modal chain of thought, M-CoT) becomes critically important. However, existing benchmarks for evaluating multi-modal models primarily focus on assessing multi-modal inputs and text-only reasoning while neglecting the importance of reasoning through multi-modal outputs. In this paper, we present a benchmark, dubbed RBench-V, designed to assess models' vision-indispensable reasoning abilities. To construct RBench-V, we carefully hand-pick 803 questions covering math, physics, counting, and games. Unlike previous benchmarks that typically specify certain input modalities, RBench-V presents problems centered on multi-modal outputs, which require image manipulation such as generating novel images and constructing auxiliary lines to support the reasoning process. We evaluate numerous open- and closed-source models on RBench-V, including o3, Gemini 2.5 Pro, Qwen2.5-VL, etc. Even the best-performing model, o3, achieves only 25.8% accuracy on RBench-V, far below the human score of 82.3%, highlighting that current models struggle to leverage multi-modal reasoning. Data and code are available at https://evalmodels.github.io/rbenchv

RBench-V：多模態輸出視覺推理模型的首選評估基準

RBench-V: A Primary Assessment for Visual Reasoning Models with Multi-modal Outputs

摘要

Support