ChatPaper.aiChatPaper

多模态思维链推理:全面综述

Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

March 16, 2025
作者: Yaoting Wang, Shengqiong Wu, Yuecheng Zhang, William Wang, Ziwei Liu, Jiebo Luo, Hao Fei
cs.AI

摘要

通过将人类逐步思维链(CoT)推理的优势扩展到多模态场景,多模态思维链(MCoT)推理近期引起了广泛的研究关注,尤其是在与多模态大语言模型(MLLMs)的融合方面。现有的MCoT研究设计了多种方法论和创新推理范式,以应对图像、视频、语音、音频、3D及结构化数据等不同模态的独特挑战,在机器人、医疗、自动驾驶及多模态生成等应用领域取得了显著成功。然而,MCoT仍面临独特的挑战与机遇,需要进一步关注以确保该领域的持续繁荣,遗憾的是,目前尚缺乏对这一领域的最新综述。为填补这一空白,我们首次系统性地综述了MCoT推理,阐明了相关的基础概念与定义。我们提供了一个全面的分类体系,并从不同应用场景的多元视角对现有方法进行了深入分析。此外,我们还对现有挑战及未来研究方向提出了见解,旨在推动多模态通用人工智能(AGI)的创新。
English
By extending the advantage of chain-of-thought (CoT) reasoning in human-like step-by-step processes to multimodal contexts, multimodal CoT (MCoT) reasoning has recently garnered significant research attention, especially in the integration with multimodal large language models (MLLMs). Existing MCoT studies design various methodologies and innovative reasoning paradigms to address the unique challenges of image, video, speech, audio, 3D, and structured data across different modalities, achieving extensive success in applications such as robotics, healthcare, autonomous driving, and multimodal generation. However, MCoT still presents distinct challenges and opportunities that require further focus to ensure consistent thriving in this field, where, unfortunately, an up-to-date review of this domain is lacking. To bridge this gap, we present the first systematic survey of MCoT reasoning, elucidating the relevant foundational concepts and definitions. We offer a comprehensive taxonomy and an in-depth analysis of current methodologies from diverse perspectives across various application scenarios. Furthermore, we provide insights into existing challenges and future research directions, aiming to foster innovation toward multimodal AGI.

Summary

AI-Generated Summary

PDF342March 18, 2025