マルチモーダル推論のためのイメージを用いた思考：基礎、方法、そして未来のフロンティア

要旨

近年、マルチモーダル推論の進展は、テキストベースの連鎖的思考（Chain-of-Thought, CoT）によって大きく前進してきた。このパラダイムでは、モデルが言語内で推論を行う。しかし、このテキスト中心のアプローチは、視覚を静的な初期コンテキストとして扱い、豊かな知覚データと離散的な記号的思考の間に根本的な「意味的ギャップ」を生み出している。人間の認知はしばしば言語を超越し、視覚を動的なメンタルスケッチパッドとして活用する。同様の進化が現在AIにおいても展開されており、単に画像について考えるモデルから、真に画像と共に考えるモデルへの根本的なパラダイムシフトが進行中である。この新たなパラダイムは、視覚情報を思考プロセスの中間ステップとして活用するモデルを特徴とし、視覚を受動的な入力から動的で操作可能な認知ワークスペースへと変容させている。本調査では、この知能の進化を、認知的自律性の増大という軌跡に沿って描き出し、外部ツールの探索、プログラム的な操作、内在的想像力という3つの主要な段階に分けて考察する。この急速に進化する分野を体系化するため、本調査は4つの主要な貢献を行う。(1) 画像と共に考えるパラダイムの基本原理とその3段階フレームワークを確立する。(2) このロードマップの各段階を特徴づけるコア手法の包括的なレビューを提供する。(3) 評価ベンチマークと変革的アプリケーションの重要な状況を分析する。(4) 重要な課題を特定し、将来の有望な方向性を概説する。この体系的な概観を通じて、より強力で人間に沿ったマルチモーダルAIに向けた将来の研究のための明確なロードマップを提供することを目指す。

English

Recent progress in multimodal reasoning has been significantly advanced by textual Chain-of-Thought (CoT), a paradigm where models conduct reasoning within language. This text-centric approach, however, treats vision as a static, initial context, creating a fundamental "semantic gap" between rich perceptual data and discrete symbolic thought. Human cognition often transcends language, utilizing vision as a dynamic mental sketchpad. A similar evolution is now unfolding in AI, marking a fundamental paradigm shift from models that merely think about images to those that can truly think with images. This emerging paradigm is characterized by models leveraging visual information as intermediate steps in their thought process, transforming vision from a passive input into a dynamic, manipulable cognitive workspace. In this survey, we chart this evolution of intelligence along a trajectory of increasing cognitive autonomy, which unfolds across three key stages: from external tool exploration, through programmatic manipulation, to intrinsic imagination. To structure this rapidly evolving field, our survey makes four key contributions. (1) We establish the foundational principles of the think with image paradigm and its three-stage framework. (2) We provide a comprehensive review of the core methods that characterize each stage of this roadmap. (3) We analyze the critical landscape of evaluation benchmarks and transformative applications. (4) We identify significant challenges and outline promising future directions. By providing this structured overview, we aim to offer a clear roadmap for future research towards more powerful and human-aligned multimodal AI.