ChatPaper.aiChatPaper

心靈之眼:從語言推理到多模態推理

Mind with Eyes: from Language Reasoning to Multimodal Reasoning

March 23, 2025
作者: Zhiyu Lin, Yifei Gao, Xian Zhao, Yunfan Yang, Jitao Sang
cs.AI

摘要

語言模型近期已進展至推理領域,然而唯有通過多模態推理,我們才能充分釋放潛力,實現更全面、更接近人類的認知能力。本綜述系統性地概述了近期多模態推理方法,將其分為兩個層次:以語言為核心的多模態推理和協作式多模態推理。前者涵蓋一次性視覺感知和主動視覺感知,其中視覺主要在語言推理中扮演輔助角色。後者則涉及推理過程中的動作生成與狀態更新,實現了模態間更為動態的交互。此外,我們分析了這些方法的技術演進,探討了其固有挑戰,並介紹了評估多模態推理性能的關鍵基準任務與評價指標。最後,我們從以下兩個視角展望了未來研究方向:(i) 從視覺-語言推理到全模態推理,以及 (ii) 從多模態推理到多模態代理。本綜述旨在提供一個結構化的概述,以激勵多模態推理研究的進一步發展。
English
Language models have recently advanced into the realm of reasoning, yet it is through multimodal reasoning that we can fully unlock the potential to achieve more comprehensive, human-like cognitive capabilities. This survey provides a systematic overview of the recent multimodal reasoning approaches, categorizing them into two levels: language-centric multimodal reasoning and collaborative multimodal reasoning. The former encompasses one-pass visual perception and active visual perception, where vision primarily serves a supporting role in language reasoning. The latter involves action generation and state update within reasoning process, enabling a more dynamic interaction between modalities. Furthermore, we analyze the technical evolution of these methods, discuss their inherent challenges, and introduce key benchmark tasks and evaluation metrics for assessing multimodal reasoning performance. Finally, we provide insights into future research directions from the following two perspectives: (i) from visual-language reasoning to omnimodal reasoning and (ii) from multimodal reasoning to multimodal agents. This survey aims to provide a structured overview that will inspire further advancements in multimodal reasoning research.

Summary

AI-Generated Summary

PDF32March 25, 2025