轻看,重思:多模态思维链推理的能与不能
Look Light, Think Heavy: What Multimodal Chain-of-Thought Reasoning Can and Cannot Do
June 21, 2026
作者: Zhuoran Jin, Kejian Zhu, Hongbang Yuan, Yupu Hao, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao
cs.AI
摘要
链式思维(Chain-of-Thought, CoT)已成为通过引导逐步推理来提升大型语言模型推理能力的标准方法,但其在多模态任务中的有效性尚未明确。本文旨在系统探究核心问题:多模态链式思维推理能做什么,在哪些场景及因何存在不足?为此,我们选取感知与推理两大类别中的12个多模态任务,使用14个非推理模型与8个推理模型进行评估。分析揭示了几项重要发现:(1)CoT并非免费午餐,需根据任务的具体需求选择性使用。在感知类任务中,CoT可能产生不良副作用,例如降低视觉定位与物体计数的性能;相反,在涉及数学、科学及多图像推理的推理任务中,CoT表现有效;(2)与原始模型相比,现有开源多模态推理模型的总体提升往往微乎其微,这或许是因为过度侧重数学推理而牺牲了更广泛的能力;(3)视觉推理仍是当前多模态CoT的关键瓶颈,模型呈现出"轻视觉、重思考"的模式——在推理过程中,言语反思时高时低,而视觉反思则持续衰减。这些结果表明,尽管多模态CoT能较好处理言语反思,但缺乏在整个推理过程中维持深层视觉内省的能力。
English
Chain-of-Thought (CoT) has become a standard method for improving reasoning capabilities in large language models (LLMs) by eliciting step-by-step thinking, but its effectiveness in multimodal tasks remains unclear. In this paper, we aim to systematically investigate the key question: What can multimodal Chain-of-Thought reasoning do, and where and why does it fall short? To this end, we evaluate 12 multimodal tasks across perception and reasoning categories using both 14 non-reasoning models and 8 reasoning models. Our analysis reveals several important findings: (1) CoT is not a free lunch and should be used selectively depending on the specific requirements of each task. For perception tasks, CoT can lead to undesirable side effects, such as reduced performance in visual grounding and object counting. In contrast, it proves effective for reasoning tasks involving mathematical, scientific, and multi-image reasoning; (2) Compared to original models, existing open-source multimodal reasoning models often yield only marginal overall improvements, possibly due to an overemphasis on mathematical reasoning at the expense of broader capabilities; (3) Visual reasoning remains a key bottleneck for current multimodal CoT, as models exhibit a Look Light, Think Heavy pattern where verbal reflection rises and falls during reasoning, whereas visual reflection consistently diminishes. These findings suggest that while multimodal CoT handles verbal reflection relatively well, it lacks the ability to maintain deep visual introspection throughout the reasoning process.