當可視化成為推理的第一步:視覺思維鏈基準MIRA
When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought
November 4, 2025
作者: Yiyang Zhou, Haoqin Tu, Zijun Wang, Zeyu Wang, Niklas Muennighoff, Fan Nie, Yejin Choi, James Zou, Chaorui Deng, Shen Yan, Haoqi Fan, Cihang Xie, Huaxiu Yao, Qinghao Ye
cs.AI
摘要
我們提出MIRA——一個專為評估模型在需要生成中間視覺圖像以實現成功推理的場景中表現而設計的新基準。與傳統僅依賴文本的思維鏈方法不同,MIRA中的任務要求模型生成並利用中間圖像(如草圖、結構圖或路徑繪圖)來引導推理過程。這種設置緊密模擬了人類通過「繪圖思考」解決複雜問題的方式。為此,MIRA聚焦於本質上具有挑戰性、涉及複雜結構、空間關係或難以僅用語言表達的推理步驟的任務。為確保評估數據的高質量,我們收錄了546個多模態問題,並標註了中間視覺圖像與最終答案。我們還為MIRA提出統一的評估協議,涵蓋三層評估輸入:僅含圖像和問題的直接輸入、帶圖像與思考提示的純文本思維鏈輸入,以及同時包含標註圖像線索與文本思考提示的視覺思維鏈輸入。為探測模型在基準上的能力上限,我們還報告了不同k值設置下的pass@k精度和多數投票精度。實驗結果表明,現有多模態大語言模型(包括最強的私有模型與優秀的開源權重模型)在僅依賴文本提示時表現不佳,但當提供中間視覺線索後,模型性能持續提升,在所有模型和任務中平均相對增益達33.7%。我們通過擴展搜索空間和設計對齊視覺思維鏈的文本提示來探測性能上限,但這兩種方法相較於我們的視覺思維鏈設置僅帶來有限改進。這些結果凸顯了想像視覺信息在MIRA上實現成功推理的關鍵作用。
English
We propose MIRA, a new benchmark designed to evaluate models in scenarios
where generating intermediate visual images is essential for successful
reasoning. Unlike traditional CoT methods that rely solely on text, tasks in
MIRA require models to generate and utilize intermediate images - such as
sketches, structural diagrams, or path drawings - to guide their reasoning
process. This setup closely mirrors how humans solve complex problems through
"drawing to think". To solve this, MIRA focuses on tasks that are intrinsically
challenging and involve complex structures, spatial relationships, or reasoning
steps that are difficult to express through language alone. To ensure that our
evaluation data is of high-quality, we include 546 multimodal problems,
annotated with intermediate visual images and final answers. We also propose a
unified evaluation protocol for MIRA that spans three levels of evaluation
input: direct input with image and question only, text-only CoT input with
image and thinking prompts, and Visual-CoT input with both annotated image
clues and textual thinking prompts. To probe the upper bound of model capacity
on our benchmark, we also report pass@k and majority voting accuracies under
different k settings. Experimental results show that existing multimodal large
language models, including strongest private models as well as strong
open-weight models, perform poorly when relying solely on textual prompts.
However, when intermediate visual cues are provided, model performance improves
consistently, yielding an average relative gain of 33.7% across all models and
tasks. We also probe the upper bound by expanding the search space and
designing textual prompts aligned with Visual-CoT, but both yield only limited
improvements compared to our Visual-CoT setting. These results underscore the
critical role of imagined visual information in enabling successful reasoning
on MIRA.