目を持つ心：言語推論からマルチモーダル推論へ

要旨

言語モデルは近年、推論の領域に進化を遂げてきたが、より包括的で人間らしい認知能力を実現する可能性を最大限に引き出すためには、マルチモーダル推論が鍵となる。本調査は、最近のマルチモーダル推論アプローチを体系的に概観し、それらを2つのレベルに分類している：言語中心のマルチモーダル推論と協調的マルチモーダル推論である。前者は、ワンパスの視覚知覚と能動的視覚知覚を含み、視覚が主に言語推論を支援する役割を担う。後者は、推論プロセス内でのアクション生成と状態更新を伴い、モダリティ間のより動的な相互作用を可能にする。さらに、これらの手法の技術的進化を分析し、内在する課題を議論し、マルチモーダル推論性能を評価するための主要なベンチマークタスクと評価指標を紹介する。最後に、今後の研究方向性について以下の2つの視点から洞察を提供する：(i)視覚-言語推論からオムニモーダル推論へ、(ii)マルチモーダル推論からマルチモーダルエージェントへ。本調査は、マルチモーダル推論研究のさらなる進展を促す構造化された概観を提供することを目的としている。

English

Language models have recently advanced into the realm of reasoning, yet it is through multimodal reasoning that we can fully unlock the potential to achieve more comprehensive, human-like cognitive capabilities. This survey provides a systematic overview of the recent multimodal reasoning approaches, categorizing them into two levels: language-centric multimodal reasoning and collaborative multimodal reasoning. The former encompasses one-pass visual perception and active visual perception, where vision primarily serves a supporting role in language reasoning. The latter involves action generation and state update within reasoning process, enabling a more dynamic interaction between modalities. Furthermore, we analyze the technical evolution of these methods, discuss their inherent challenges, and introduce key benchmark tasks and evaluation metrics for assessing multimodal reasoning performance. Finally, we provide insights into future research directions from the following two perspectives: (i) from visual-language reasoning to omnimodal reasoning and (ii) from multimodal reasoning to multimodal agents. This survey aims to provide a structured overview that will inspire further advancements in multimodal reasoning research.

目を持つ心：言語推論からマルチモーダル推論へ

Mind with Eyes: from Language Reasoning to Multimodal Reasoning

要旨

Support