以圖像思維進行多模態推理：基礎、方法與未來前沿

摘要

近期，多模态推理领域因文本链式思维（Chain-of-Thought, CoT）的引入而取得显著进展，该范式使模型能够在语言内部进行推理。然而，这种以文本为中心的方法将视觉视为静态的初始背景，在丰富的感知数据与离散的符号思维之间形成了根本性的“语义鸿沟”。人类认知往往超越语言，利用视觉作为动态的心理画板。人工智能领域正经历着类似的演变，标志着从仅仅思考图像的模型向真正能够用图像思考的模型发生根本性的范式转变。这一新兴范式的特征在于，模型将视觉信息作为其思维过程中的中间步骤，从而将视觉从被动输入转变为动态、可操控的认知工作空间。在本综述中，我们沿着认知自主性不断增强的轨迹，描绘了这一智能演化的历程，该历程跨越三个关键阶段：从外部工具探索，到程序化操控，再到内在想象。为了构建这一快速发展的领域，我们的综述做出了四项关键贡献：（1）确立了“用图像思考”范式的基本原则及其三阶段框架；（2）全面回顾了表征这一路线图各阶段的核心方法；（3）分析了评估基准与变革性应用的关键格局；（4）识别了重大挑战并勾勒了未来发展的前景。通过提供这一结构化概述，我们旨在为未来研究提供清晰的路线图，以推动更强大且与人类认知一致的多模态人工智能的发展。

English

Recent progress in multimodal reasoning has been significantly advanced by textual Chain-of-Thought (CoT), a paradigm where models conduct reasoning within language. This text-centric approach, however, treats vision as a static, initial context, creating a fundamental "semantic gap" between rich perceptual data and discrete symbolic thought. Human cognition often transcends language, utilizing vision as a dynamic mental sketchpad. A similar evolution is now unfolding in AI, marking a fundamental paradigm shift from models that merely think about images to those that can truly think with images. This emerging paradigm is characterized by models leveraging visual information as intermediate steps in their thought process, transforming vision from a passive input into a dynamic, manipulable cognitive workspace. In this survey, we chart this evolution of intelligence along a trajectory of increasing cognitive autonomy, which unfolds across three key stages: from external tool exploration, through programmatic manipulation, to intrinsic imagination. To structure this rapidly evolving field, our survey makes four key contributions. (1) We establish the foundational principles of the think with image paradigm and its three-stage framework. (2) We provide a comprehensive review of the core methods that characterize each stage of this roadmap. (3) We analyze the critical landscape of evaluation benchmarks and transformative applications. (4) We identify significant challenges and outline promising future directions. By providing this structured overview, we aim to offer a clear roadmap for future research towards more powerful and human-aligned multimodal AI.