基于图像的多模态推理思维：基础、方法与未来前沿

摘要

近期，多模态推理领域因文本链式思维（CoT）范式而取得显著进展，该范式让模型在语言内部进行推理。然而，这种以文本为中心的方法将视觉视为静态的初始背景，在丰富的感知数据与离散的符号思维之间形成了根本性的“语义鸿沟”。人类认知常常超越语言，将视觉作为动态的心理画板加以利用。如今，人工智能领域正经历着类似的演变，标志着从仅仅思考图像的模型向真正能够用图像思考的模型发生了根本性的范式转变。这一新兴范式的特征在于，模型将视觉信息作为其思维过程中的中间步骤，从而将视觉从被动输入转变为动态、可操控的认知工作空间。在本综述中，我们沿着认知自主性不断增强的轨迹，描绘了这一智能演化的历程，该历程跨越了三个关键阶段：从外部工具探索，到程序化操控，再到内在想象。为了构建这一快速发展的领域，我们的综述做出了四项关键贡献：（1）我们确立了“用图像思考”范式的基本原则及其三阶段框架；（2）我们对这一路线图各阶段的核心方法进行了全面回顾；（3）我们分析了评估基准与变革性应用的关键格局；（4）我们识别了重大挑战并勾勒了未来发展的光明方向。通过提供这一结构化概览，我们旨在为未来研究指明清晰路线，以推动更强大、更符合人类需求的多模态人工智能的发展。

English

Recent progress in multimodal reasoning has been significantly advanced by textual Chain-of-Thought (CoT), a paradigm where models conduct reasoning within language. This text-centric approach, however, treats vision as a static, initial context, creating a fundamental "semantic gap" between rich perceptual data and discrete symbolic thought. Human cognition often transcends language, utilizing vision as a dynamic mental sketchpad. A similar evolution is now unfolding in AI, marking a fundamental paradigm shift from models that merely think about images to those that can truly think with images. This emerging paradigm is characterized by models leveraging visual information as intermediate steps in their thought process, transforming vision from a passive input into a dynamic, manipulable cognitive workspace. In this survey, we chart this evolution of intelligence along a trajectory of increasing cognitive autonomy, which unfolds across three key stages: from external tool exploration, through programmatic manipulation, to intrinsic imagination. To structure this rapidly evolving field, our survey makes four key contributions. (1) We establish the foundational principles of the think with image paradigm and its three-stage framework. (2) We provide a comprehensive review of the core methods that characterize each stage of this roadmap. (3) We analyze the critical landscape of evaluation benchmarks and transformative applications. (4) We identify significant challenges and outline promising future directions. By providing this structured overview, we aim to offer a clear roadmap for future research towards more powerful and human-aligned multimodal AI.