以任意模態提問:多模態檢索增強生成之全面綜述
Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation
February 12, 2025
作者: Mohammad Mahdi Abootorabi, Amirhosein Zobeiri, Mahdi Dehghani, Mohammadali Mohammadkhani, Bardia Mohammadi, Omid Ghahroodi, Mahdieh Soleymani Baghshah, Ehsaneddin Asgari
cs.AI
摘要
大型語言模型(LLMs)由於依賴於靜態訓練數據,常面臨幻覺和知識過時的問題。檢索增強生成(RAG)通過整合外部動態信息來緩解這些問題,從而增強事實性和時效性。近年來,多模態學習的進展促進了多模態RAG的發展,該技術融合了文本、圖像、音頻和視頻等多種模態,以提升生成內容的質量。然而,跨模態對齊與推理為多模態RAG帶來了獨特的挑戰,使其有別於傳統的單模態RAG。本調查對多模態RAG系統進行了結構化且全面的分析,涵蓋了數據集、指標、基準、評估方法、以及檢索、融合、增強和生成等方面的創新。我們詳細審視了訓練策略、魯棒性增強和損失函數,同時探討了多模態RAG的各種應用場景。此外,我們還討論了該領域的開放性挑戰和未來研究方向,以支持這一不斷發展領域的進步。本調查為開發更強大、更可靠的AI系統奠定了基礎,這些系統能有效利用多模態動態外部知識庫。相關資源可於https://github.com/llm-lab-org/Multimodal-RAG-Survey獲取。
English
Large Language Models (LLMs) struggle with hallucinations and outdated
knowledge due to their reliance on static training data. Retrieval-Augmented
Generation (RAG) mitigates these issues by integrating external dynamic
information enhancing factual and updated grounding. Recent advances in
multimodal learning have led to the development of Multimodal RAG,
incorporating multiple modalities such as text, images, audio, and video to
enhance the generated outputs. However, cross-modal alignment and reasoning
introduce unique challenges to Multimodal RAG, distinguishing it from
traditional unimodal RAG. This survey offers a structured and comprehensive
analysis of Multimodal RAG systems, covering datasets, metrics, benchmarks,
evaluation, methodologies, and innovations in retrieval, fusion, augmentation,
and generation. We precisely review training strategies, robustness
enhancements, and loss functions, while also exploring the diverse Multimodal
RAG scenarios. Furthermore, we discuss open challenges and future research
directions to support advancements in this evolving field. This survey lays the
foundation for developing more capable and reliable AI systems that effectively
leverage multimodal dynamic external knowledge bases. Resources are available
at https://github.com/llm-lab-org/Multimodal-RAG-Survey.Summary
AI-Generated Summary