RBench-V:面向多模态输出视觉推理模型的基础评估框架
RBench-V: A Primary Assessment for Visual Reasoning Models with Multi-modal Outputs
May 22, 2025
作者: Meng-Hao Guo, Xuanyu Chu, Qianrui Yang, Zhe-Han Mo, Yiqing Shen, Pei-lin Li, Xinjie Lin, Jinnian Zhang, Xin-Sheng Chen, Yi Zhang, Kiyohiro Nakayama, Zhengyang Geng, Houwen Peng, Han Hu, Shi-Nin Hu
cs.AI
摘要
以GPT-4o、Gemini和o3为代表的原生多模态模型和全能模型的快速发展,标志着智能演进的一个重要里程碑。这些模型能够处理和生成跨模态内容,如文本和图像。系统评估其在视觉思维过程(也称为多模态思维链,M-CoT)中的多模态输出能力变得至关重要。然而,现有的多模态模型评估基准主要关注多模态输入和纯文本推理,而忽视了通过多模态输出进行推理的重要性。本文提出了一个名为RBench-V的基准,旨在评估模型的视觉不可或缺的推理能力。为构建RBench-V,我们精心挑选了803个涵盖数学、物理、计数和游戏的问题。与以往通常指定某些输入模态的基准不同,RBench-V提出的问题以多模态输出为核心,需要图像操作,如生成新图像和构建辅助线以支持推理过程。我们在RBench-V上评估了众多开源和闭源模型,包括o3、Gemini 2.5 Pro、Qwen2.5-VL等。即使表现最好的模型o3,在RBench-V上的准确率也仅为25.8%,远低于人类的82.3%,这表明当前模型在多模态推理方面仍面临挑战。数据和代码可在https://evalmodels.github.io/rbenchv获取。
English
The rapid advancement of native multi-modal models and omni-models,
exemplified by GPT-4o, Gemini, and o3, with their capability to process and
generate content across modalities such as text and images, marks a significant
milestone in the evolution of intelligence. Systematic evaluation of their
multi-modal output capabilities in visual thinking processes (also known as
multi-modal chain of thought, M-CoT) becomes critically important. However,
existing benchmarks for evaluating multi-modal models primarily focus on
assessing multi-modal inputs and text-only reasoning while neglecting the
importance of reasoning through multi-modal outputs. In this paper, we present
a benchmark, dubbed RBench-V, designed to assess models' vision-indispensable
reasoning abilities. To construct RBench-V, we carefully hand-pick 803
questions covering math, physics, counting, and games. Unlike previous
benchmarks that typically specify certain input modalities, RBench-V presents
problems centered on multi-modal outputs, which require image manipulation such
as generating novel images and constructing auxiliary lines to support the
reasoning process. We evaluate numerous open- and closed-source models on
RBench-V, including o3, Gemini 2.5 Pro, Qwen2.5-VL, etc. Even the
best-performing model, o3, achieves only 25.8% accuracy on RBench-V, far below
the human score of 82.3%, highlighting that current models struggle to leverage
multi-modal reasoning. Data and code are available at
https://evalmodels.github.io/rbenchvSummary
AI-Generated Summary