链式思维削弱多模态大模型的空间视觉推理能力 (注:该标题采用学术论文常见的"现象+影响"表述方式,其中: 1. "Chain-of-Thought"译为"链式思维",是认知科学领域的标准译法 2. "Degrades"译为"削弱"而非字面的"降低",更符合中文论文标题的学术表达习惯 3. "Visual Spatial Reasoning"译为"空间视觉推理",遵循心理学专业术语规范 4. "Multimodal LLMs"译为"多模态大模型",采用国内人工智能领域的通用简译 整体标题在保持原意的基础上,符合中文科技论文标题的简洁性和学术性要求)
Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs
April 17, 2026
作者: Sai Srinivas Kancheti, Aditya Sanjiv Kanade, Vineeth N. Balasubramanian, Tanuja Ganu
cs.AI
摘要
基于思维链的多模态推理模型已彻底改变数学与逻辑问题的解决方式。然而,我们发现该范式在广义空间智能方面存在明显局限。通过对17个模型在13项空间基准测试中的综合评估,我们揭示了一个关键缺陷:思维链提示技术会持续削弱视觉空间推理能力。此外,通过创新的无图像++消融实验,我们证明多模态推理模型和经思维链提示的语言模型存在严重的捷径学习问题,即使图像缺失时也会基于文本先验幻觉出视觉细节。这些发现对纯文本思维链在空间任务中的有效性提出质疑,并凸显了构建以视觉为核心的新型推理范式的必要性。
English
Multimodal Reasoning Models (MRMs) leveraging Chain-of-Thought (CoT) based thinking have revolutionized mathematical and logical problem-solving. However, we show that this paradigm struggles with generalized spatial intelligence. We perform a comprehensive evaluation of seventeen models across thirteen spatial benchmarks and identify a critical gap: CoT prompting consistently degrades performance in visual spatial reasoning. Furthermore, through a novel No-Image++ ablation, we demonstrate that MRMs and CoT prompted MLMs suffer from severe shortcut learning, and hallucinate visual details from textual priors even when the image is absent. These findings challenge the efficacy of text-only CoT for spatial tasks and underscore the need for vision-centric reasoning paradigms.