当可视化成为推理的第一步:MIRA,一个视觉思维链基准测试
When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought
November 4, 2025
作者: Yiyang Zhou, Haoqin Tu, Zijun Wang, Zeyu Wang, Niklas Muennighoff, Fan Nie, Yejin Choi, James Zou, Chaorui Deng, Shen Yan, Haoqi Fan, Cihang Xie, Huaxiu Yao, Qinghao Ye
cs.AI
摘要
我们提出MIRA这一新基准,旨在评估需要生成中间视觉图像以完成推理任务的场景。与传统仅依赖文本的思维链方法不同,MIRA中的任务要求模型生成并利用草图、结构图或路径图等中间图像来引导推理过程,这种设置高度模拟了人类通过"绘图思考"解决复杂问题的方式。该基准专注于本质上具有挑战性、涉及复杂结构、空间关系或难以仅用语言表达的推理步骤的任务。为确保评估数据的高质量,我们收录了546个多模态问题,并标注了中间视觉图像与最终答案。
我们还为MIRA设计了统一的三级评估输入协议:仅含图像和问题的直接输入、附带图像与思维提示的纯文本思维链输入、同时包含标注图像线索与文本思维提示的视觉思维链输入。为探究模型在基准上的性能上限,我们报告了不同k值设置下的pass@k准确率和多数投票准确率。实验结果表明,现有多模态大语言模型(包括最强私有模型和优秀开源模型)在仅使用文本提示时表现不佳,但当提供中间视觉线索后,模型性能均实现稳定提升,所有模型和任务平均相对增益达33.7%。通过扩展搜索空间和设计对齐视觉思维链的文本提示来探索性能上限时,这两种方法相较我们的视觉思维链设置仅能带来有限改进。这些发现凸显了想象视觉信息对于在MIRA基准上实现成功推理的关键作用。
English
We propose MIRA, a new benchmark designed to evaluate models in scenarios
where generating intermediate visual images is essential for successful
reasoning. Unlike traditional CoT methods that rely solely on text, tasks in
MIRA require models to generate and utilize intermediate images - such as
sketches, structural diagrams, or path drawings - to guide their reasoning
process. This setup closely mirrors how humans solve complex problems through
"drawing to think". To solve this, MIRA focuses on tasks that are intrinsically
challenging and involve complex structures, spatial relationships, or reasoning
steps that are difficult to express through language alone. To ensure that our
evaluation data is of high-quality, we include 546 multimodal problems,
annotated with intermediate visual images and final answers. We also propose a
unified evaluation protocol for MIRA that spans three levels of evaluation
input: direct input with image and question only, text-only CoT input with
image and thinking prompts, and Visual-CoT input with both annotated image
clues and textual thinking prompts. To probe the upper bound of model capacity
on our benchmark, we also report pass@k and majority voting accuracies under
different k settings. Experimental results show that existing multimodal large
language models, including strongest private models as well as strong
open-weight models, perform poorly when relying solely on textual prompts.
However, when intermediate visual cues are provided, model performance improves
consistently, yielding an average relative gain of 33.7% across all models and
tasks. We also probe the upper bound by expanding the search space and
designing textual prompts aligned with Visual-CoT, but both yield only limited
improvements compared to our Visual-CoT setting. These results underscore the
critical role of imagined visual information in enabling successful reasoning
on MIRA.