ChatPaper.aiChatPaper

当可视化成为推理的第一步:MIRA,一个视觉思维链基准测试

When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought

November 4, 2025
作者: Yiyang Zhou, Haoqin Tu, Zijun Wang, Zeyu Wang, Niklas Muennighoff, Fan Nie, Yejin Choi, James Zou, Chaorui Deng, Shen Yan, Haoqi Fan, Cihang Xie, Huaxiu Yao, Qinghao Ye
cs.AI

摘要

我们提出MIRA这一新基准,旨在评估需要生成中间视觉图像以完成推理任务的场景。与传统仅依赖文本的思维链方法不同,MIRA中的任务要求模型生成并利用草图、结构图或路径图等中间图像来引导推理过程,这种设置高度模拟了人类通过"绘图思考"解决复杂问题的方式。该基准专注于本质上具有挑战性、涉及复杂结构、空间关系或难以仅用语言表达的推理步骤的任务。为确保评估数据的高质量,我们收录了546个多模态问题,并标注了中间视觉图像与最终答案。 我们还为MIRA设计了统一的三级评估输入协议:仅含图像和问题的直接输入、附带图像与思维提示的纯文本思维链输入、同时包含标注图像线索与文本思维提示的视觉思维链输入。为探究模型在基准上的性能上限,我们报告了不同k值设置下的pass@k准确率和多数投票准确率。实验结果表明,现有多模态大语言模型(包括最强私有模型和优秀开源模型)在仅使用文本提示时表现不佳,但当提供中间视觉线索后,模型性能均实现稳定提升,所有模型和任务平均相对增益达33.7%。通过扩展搜索空间和设计对齐视觉思维链的文本提示来探索性能上限时,这两种方法相较我们的视觉思维链设置仅能带来有限改进。这些发现凸显了想象视觉信息对于在MIRA基准上实现成功推理的关键作用。
English
We propose MIRA, a new benchmark designed to evaluate models in scenarios where generating intermediate visual images is essential for successful reasoning. Unlike traditional CoT methods that rely solely on text, tasks in MIRA require models to generate and utilize intermediate images - such as sketches, structural diagrams, or path drawings - to guide their reasoning process. This setup closely mirrors how humans solve complex problems through "drawing to think". To solve this, MIRA focuses on tasks that are intrinsically challenging and involve complex structures, spatial relationships, or reasoning steps that are difficult to express through language alone. To ensure that our evaluation data is of high-quality, we include 546 multimodal problems, annotated with intermediate visual images and final answers. We also propose a unified evaluation protocol for MIRA that spans three levels of evaluation input: direct input with image and question only, text-only CoT input with image and thinking prompts, and Visual-CoT input with both annotated image clues and textual thinking prompts. To probe the upper bound of model capacity on our benchmark, we also report pass@k and majority voting accuracies under different k settings. Experimental results show that existing multimodal large language models, including strongest private models as well as strong open-weight models, perform poorly when relying solely on textual prompts. However, when intermediate visual cues are provided, model performance improves consistently, yielding an average relative gain of 33.7% across all models and tasks. We also probe the upper bound by expanding the search space and designing textual prompts aligned with Visual-CoT, but both yield only limited improvements compared to our Visual-CoT setting. These results underscore the critical role of imagined visual information in enabling successful reasoning on MIRA.
PDF562December 2, 2025