心灵之眼：从语言推理到多模态推理

摘要

语言模型近期已迈入推理领域，然而，唯有通过多模态推理，我们方能充分释放潜能，实现更为全面、类人的认知能力。本综述系统梳理了最新的多模态推理方法，将其划分为两个层次：以语言为中心的多模态推理与协作式多模态推理。前者涵盖一次性视觉感知与主动视觉感知，其中视觉主要作为语言推理的辅助角色；后者则涉及推理过程中的动作生成与状态更新，促进模态间更为动态的交互。此外，我们剖析了这些方法的技术演进，探讨了其内在挑战，并介绍了评估多模态推理性能的关键基准任务与评价指标。最后，我们从以下两个视角展望了未来研究方向：(i) 从视觉-语言推理迈向全模态推理，以及(ii) 从多模态推理拓展至多模态智能体。本综述旨在提供一个结构化的概览，以期激发多模态推理研究的进一步突破。

English

Language models have recently advanced into the realm of reasoning, yet it is through multimodal reasoning that we can fully unlock the potential to achieve more comprehensive, human-like cognitive capabilities. This survey provides a systematic overview of the recent multimodal reasoning approaches, categorizing them into two levels: language-centric multimodal reasoning and collaborative multimodal reasoning. The former encompasses one-pass visual perception and active visual perception, where vision primarily serves a supporting role in language reasoning. The latter involves action generation and state update within reasoning process, enabling a more dynamic interaction between modalities. Furthermore, we analyze the technical evolution of these methods, discuss their inherent challenges, and introduce key benchmark tasks and evaluation metrics for assessing multimodal reasoning performance. Finally, we provide insights into future research directions from the following two perspectives: (i) from visual-language reasoning to omnimodal reasoning and (ii) from multimodal reasoning to multimodal agents. This survey aims to provide a structured overview that will inspire further advancements in multimodal reasoning research.

心灵之眼：从语言推理到多模态推理

Mind with Eyes: from Language Reasoning to Multimodal Reasoning

摘要

Support