任意のモダリティで問う：マルチモーダル検索拡張生成に関する包括的調査

要旨

大規模言語モデル（LLM）は、静的トレーニングデータへの依存のため、幻覚や古い知識に悩まされることがあります。Retrieval-Augmented Generation（RAG）は、外部の動的情報を統合することで、これらの問題を緩和し、事実に基づいた最新の基盤を強化します。近年のマルチモーダル学習の進展により、テキスト、画像、音声、ビデオなどの複数のモダリティを組み込むことで生成出力を向上させるMultimodal RAGが開発されました。しかし、クロスモーダルの整合性と推論は、従来の単一モーダルRAGとは異なる独自の課題をMultimodal RAGに導入します。本調査では、Multimodal RAGシステムに関する構造化された包括的な分析を提供し、データセット、メトリクス、ベンチマーク、評価、方法論、および検索、融合、拡張、生成におけるイノベーションをカバーします。トレーニング戦略、堅牢性の向上、損失関数を正確にレビューし、多様なMultimodal RAGシナリオを探求します。さらに、この進化する分野の進展を支援するための未解決の課題と将来の研究方向性について議論します。本調査は、マルチモーダル動的外部知識ベースを効果的に活用する、より能力が高く信頼性のあるAIシステムの開発の基盤を築きます。リソースはhttps://github.com/llm-lab-org/Multimodal-RAG-Surveyで利用可能です。

English

Large Language Models (LLMs) struggle with hallucinations and outdated knowledge due to their reliance on static training data. Retrieval-Augmented Generation (RAG) mitigates these issues by integrating external dynamic information enhancing factual and updated grounding. Recent advances in multimodal learning have led to the development of Multimodal RAG, incorporating multiple modalities such as text, images, audio, and video to enhance the generated outputs. However, cross-modal alignment and reasoning introduce unique challenges to Multimodal RAG, distinguishing it from traditional unimodal RAG. This survey offers a structured and comprehensive analysis of Multimodal RAG systems, covering datasets, metrics, benchmarks, evaluation, methodologies, and innovations in retrieval, fusion, augmentation, and generation. We precisely review training strategies, robustness enhancements, and loss functions, while also exploring the diverse Multimodal RAG scenarios. Furthermore, we discuss open challenges and future research directions to support advancements in this evolving field. This survey lays the foundation for developing more capable and reliable AI systems that effectively leverage multimodal dynamic external knowledge bases. Resources are available at https://github.com/llm-lab-org/Multimodal-RAG-Survey.

任意のモダリティで問う：マルチモーダル検索拡張生成に関する包括的調査

Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation

要旨

Support