軽く見て、重く考える：マルチモーダル連鎖思考推論ができることとできないこと

要旨

思考連鎖（CoT）は、段階的な思考を引き出すことで大規模言語モデル（LLM）の推論能力を向上させる標準的な手法となっているが、マルチモーダルタスクにおけるその有効性は依然として不明である。本論文では、次の重要な問いを体系的に調査することを目的とする：マルチモーダル思考連鎖推論は何ができるのか、そしてどこでなぜ不十分なのか？このために、14の非推論モデルと8の推論モデルを用いて、知覚と推論のカテゴリにわたる12のマルチモーダルタスクを評価する。分析により、以下の重要な知見が明らかになった：（1）CoTはフリーランチではなく、各タスクの具体的な要件に応じて選択的に使用すべきである。知覚タスクにおいては、CoTは視覚的グラウンディングや物体計数における性能低下など、望ましくない副作用を引き起こす可能性がある。対照的に、数学的、科学的、マルチイメージ推論を含む推論タスクには効果的である。（2）元のモデルと比較して、既存のオープンソースのマルチモーダル推論モデルは、数学的推論に過度に重点を置きその他の能力を犠牲にしているためと思われるが、全体的な改善はわずかであることが多い。（3）視覚的推論は現在のマルチモーダルCoTにとって主要なボトルネックであり、モデルは「軽視・重考（Look Light, Think Heavy）」パターンを示し、推論中に言語的反射は増減する一方、視覚的反射は一貫して減少する。これらの知見は、マルチモーダルCoTは言語的反射を比較的うまく処理できるものの、推論プロセス全体を通じて深い視覚的内省を維持する能力が欠如していることを示唆している。

English

Chain-of-Thought (CoT) has become a standard method for improving reasoning capabilities in large language models (LLMs) by eliciting step-by-step thinking, but its effectiveness in multimodal tasks remains unclear. In this paper, we aim to systematically investigate the key question: What can multimodal Chain-of-Thought reasoning do, and where and why does it fall short? To this end, we evaluate 12 multimodal tasks across perception and reasoning categories using both 14 non-reasoning models and 8 reasoning models. Our analysis reveals several important findings: (1) CoT is not a free lunch and should be used selectively depending on the specific requirements of each task. For perception tasks, CoT can lead to undesirable side effects, such as reduced performance in visual grounding and object counting. In contrast, it proves effective for reasoning tasks involving mathematical, scientific, and multi-image reasoning; (2) Compared to original models, existing open-source multimodal reasoning models often yield only marginal overall improvements, possibly due to an overemphasis on mathematical reasoning at the expense of broader capabilities; (3) Visual reasoning remains a key bottleneck for current multimodal CoT, as models exhibit a Look Light, Think Heavy pattern where verbal reflection rises and falls during reasoning, whereas visual reflection consistently diminishes. These findings suggest that while multimodal CoT handles verbal reflection relatively well, it lacks the ability to maintain deep visual introspection throughout the reasoning process.