多模态大语言模型的思维链方法削弱视觉空间推理能力

摘要

基于思维链（CoT）的多模态推理模型（MRM）已彻底改变数学与逻辑问题的解决方式。然而，我们发现该范式在广义空间智能方面存在明显不足。通过对17个模型在13项空间基准测试中的全面评估，我们揭示了一个关键缺陷：CoT提示法会持续削弱视觉空间推理的表现。此外，通过创新的无图像++消融实验，我们证明MRM和采用CoT的MLM存在严重的捷径学习问题，即使图像缺失时也会基于文本先验幻觉出视觉细节。这些发现对纯文本CoT在空间任务中的有效性提出质疑，并凸显了构建以视觉为核心的推理范式的必要性。

English

Multimodal Reasoning Models (MRMs) leveraging Chain-of-Thought (CoT) based thinking have revolutionized mathematical and logical problem-solving. However, we show that this paradigm struggles with generalized spatial intelligence. We perform a comprehensive evaluation of seventeen models across thirteen spatial benchmarks and identify a critical gap: CoT prompting consistently degrades performance in visual spatial reasoning. Furthermore, through a novel No-Image++ ablation, we demonstrate that MRMs and CoT prompted MLMs suffer from severe shortcut learning, and hallucinate visual details from textual priors even when the image is absent. These findings challenge the efficacy of text-only CoT for spatial tasks and underscore the need for vision-centric reasoning paradigms.

多模态大语言模型的思维链方法削弱视觉空间推理能力

Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs

摘要

Support