感知、推理、思考与规划:大型多模态推理模型综述
Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models
May 8, 2025
作者: Yunxin Li, Zhenyu Liu, Zitao Li, Xuanyu Zhang, Zhenran Xu, Xinyu Chen, Haoyuan Shi, Shenyuan Jiang, Xintong Wang, Jifang Wang, Shouzheng Huang, Xinping Zhao, Borui Jiang, Lanqing Hong, Longyue Wang, Zhuotao Tian, Baoxing Huai, Wenhan Luo, Weihua Luo, Zheng Zhang, Baotian Hu, Min Zhang
cs.AI
摘要
推理是智能的核心,它塑造了决策、得出结论以及跨领域泛化的能力。在人工智能领域,随着系统越来越多地在开放、不确定和多模态环境中运行,推理成为实现稳健和自适应行为的关键。大型多模态推理模型(LMRMs)作为一种有前景的范式应运而生,它整合了文本、图像、音频和视频等多种模态,以支持复杂的推理能力,并致力于实现全面的感知、精确的理解和深度的推理。随着研究的深入,多模态推理已从模块化、感知驱动的流程迅速演变为统一、以语言为中心的框架,这些框架提供了更加连贯的跨模态理解。尽管指令微调和强化学习提升了模型的推理能力,但在全模态泛化、推理深度和代理行为方面仍存在重大挑战。针对这些问题,我们提出了一份全面且结构化的多模态推理研究综述,围绕一个四阶段的发展路线图组织,该路线图反映了该领域设计理念的转变和新兴能力。首先,我们回顾了基于任务特定模块的早期努力,其中推理隐含地嵌入在表示、对齐和融合的各个阶段。接着,我们探讨了将推理统一到多模态大语言模型(LLMs)中的最新方法,如多模态思维链(MCoT)和多模态强化学习等进展,使得推理链更加丰富和结构化。最后,基于OpenAI O3和O4-mini在挑战性基准测试和实验案例中的实证洞察,我们讨论了原生大型多模态推理模型(N-LMRMs)的概念方向,这些模型旨在支持复杂现实环境中可扩展、代理性和自适应的推理与规划。
English
Reasoning lies at the heart of intelligence, shaping the ability to make
decisions, draw conclusions, and generalize across domains. In artificial
intelligence, as systems increasingly operate in open, uncertain, and
multimodal environments, reasoning becomes essential for enabling robust and
adaptive behavior. Large Multimodal Reasoning Models (LMRMs) have emerged as a
promising paradigm, integrating modalities such as text, images, audio, and
video to support complex reasoning capabilities and aiming to achieve
comprehensive perception, precise understanding, and deep reasoning. As
research advances, multimodal reasoning has rapidly evolved from modular,
perception-driven pipelines to unified, language-centric frameworks that offer
more coherent cross-modal understanding. While instruction tuning and
reinforcement learning have improved model reasoning, significant challenges
remain in omni-modal generalization, reasoning depth, and agentic behavior. To
address these issues, we present a comprehensive and structured survey of
multimodal reasoning research, organized around a four-stage developmental
roadmap that reflects the field's shifting design philosophies and emerging
capabilities. First, we review early efforts based on task-specific modules,
where reasoning was implicitly embedded across stages of representation,
alignment, and fusion. Next, we examine recent approaches that unify reasoning
into multimodal LLMs, with advances such as Multimodal Chain-of-Thought (MCoT)
and multimodal reinforcement learning enabling richer and more structured
reasoning chains. Finally, drawing on empirical insights from challenging
benchmarks and experimental cases of OpenAI O3 and O4-mini, we discuss the
conceptual direction of native large multimodal reasoning models (N-LMRMs),
which aim to support scalable, agentic, and adaptive reasoning and planning in
complex, real-world environments.Summary
AI-Generated Summary