マルチモーダルLLMにおける視覚的空間推論能力に対する連鎖的思考の悪影響

要旨

マルチモーダル推論モデル（MRM）は、思考の連鎖（Chain-of-Thought: CoT）に基づく思考を活用することで、数学的・論理的問題解決に革命をもたらしてきた。しかし本論文では、このパラダイムが汎用的な空間知能において苦戦することを示す。我々は13の空間ベンチマークにおいて17のモデルを包括的に評価し、決定的なギャップを特定した：CoTプロンプティングは、視覚的空間推論における性能を一貫して低下させる。さらに、新規のNo-Image++アブレーションを通して、MRMおよびCoTプロンプトが適用されたマルチモーダル言語モデル（MLM）が深刻なショートカット学習に陥っており、画像が存在しない場合でもテキストの事前情報から視覚的詳細を幻覚することを実証する。これらの知見は、空間タスクにおけるテキストのみのCoTの有効性に疑問を投げかけ、視覚中心の推論パラダイムの必要性を強く示唆するものである。

English

Multimodal Reasoning Models (MRMs) leveraging Chain-of-Thought (CoT) based thinking have revolutionized mathematical and logical problem-solving. However, we show that this paradigm struggles with generalized spatial intelligence. We perform a comprehensive evaluation of seventeen models across thirteen spatial benchmarks and identify a critical gap: CoT prompting consistently degrades performance in visual spatial reasoning. Furthermore, through a novel No-Image++ ablation, we demonstrate that MRMs and CoT prompted MLMs suffer from severe shortcut learning, and hallucinate visual details from textual priors even when the image is absent. These findings challenge the efficacy of text-only CoT for spatial tasks and underscore the need for vision-centric reasoning paradigms.

マルチモーダルLLMにおける視覚的空間推論能力に対する連鎖的思考の悪影響

Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs

要旨

Support