ChatPaper.aiChatPaper

多模態思維鏈推理:全面綜述

Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

March 16, 2025
作者: Yaoting Wang, Shengqiong Wu, Yuecheng Zhang, William Wang, Ziwei Liu, Jiebo Luo, Hao Fei
cs.AI

摘要

通過將人類逐步思維的鏈式推理(CoT)優勢擴展到多模態情境中,多模態鏈式推理(MCoT)近期獲得了顯著的研究關注,尤其是在與多模態大語言模型(MLLMs)的整合方面。現有的MCoT研究設計了多種方法和創新的推理範式,以應對圖像、視頻、語音、音頻、3D和結構化數據等不同模態的獨特挑戰,並在機器人、醫療、自動駕駛和多模態生成等應用中取得了廣泛成功。然而,MCoT仍面臨著獨特的挑戰和機遇,需要進一步關注以確保該領域的持續繁榮,而遺憾的是,目前尚缺乏對這一領域的最新綜述。為彌補這一空白,我們首次系統性地梳理了MCoT推理,闡明了相關的基礎概念和定義。我們提供了一個全面的分類體系,並從多種應用場景的不同視角深入分析了當前的方法。此外,我們還對現有挑戰和未來研究方向提出了見解,旨在推動多模態通用人工智能(AGI)的創新發展。
English
By extending the advantage of chain-of-thought (CoT) reasoning in human-like step-by-step processes to multimodal contexts, multimodal CoT (MCoT) reasoning has recently garnered significant research attention, especially in the integration with multimodal large language models (MLLMs). Existing MCoT studies design various methodologies and innovative reasoning paradigms to address the unique challenges of image, video, speech, audio, 3D, and structured data across different modalities, achieving extensive success in applications such as robotics, healthcare, autonomous driving, and multimodal generation. However, MCoT still presents distinct challenges and opportunities that require further focus to ensure consistent thriving in this field, where, unfortunately, an up-to-date review of this domain is lacking. To bridge this gap, we present the first systematic survey of MCoT reasoning, elucidating the relevant foundational concepts and definitions. We offer a comprehensive taxonomy and an in-depth analysis of current methodologies from diverse perspectives across various application scenarios. Furthermore, we provide insights into existing challenges and future research directions, aiming to foster innovation toward multimodal AGI.

Summary

AI-Generated Summary

PDF342March 18, 2025