ChatPaper.aiChatPaper

UniversalRAG:面向多模态与多粒度跨语料库的检索增强生成

UniversalRAG: Retrieval-Augmented Generation over Multiple Corpora with Diverse Modalities and Granularities

April 29, 2025
作者: Woongyeong Yeo, Kangsan Kim, Soyeong Jeong, Jinheon Baek, Sung Ju Hwang
cs.AI

摘要

检索增强生成(RAG)在通过结合与查询相关的外部知识来提升模型回答的事实准确性方面展现了显著潜力。然而,现有的大多数RAG方法仅限于纯文本语料库,尽管近期研究已将其扩展至图像和视频等其他模态,但这些方法通常仅针对单一模态的特定语料库进行操作。相比之下,现实世界中的查询所需的知识类型千差万别,单一类型的知识源难以全面应对。为此,我们提出了UniversalRAG,一种新颖的RAG框架,旨在从具有多样模态和粒度的异构知识源中检索并整合知识。具体而言,基于观察到将所有模态强制纳入源自单一组合语料库的统一表示空间会导致模态鸿沟,即检索倾向于偏向与查询相同模态的项目,我们提出了一种模态感知路由机制,该机制能动态识别最合适的模态特定语料库,并在其中执行针对性检索。此外,超越模态层面,我们将每种模态组织成多个粒度级别,从而能够根据查询的复杂性和范围进行精细化的检索。我们在涵盖多种模态的8个基准测试上验证了UniversalRAG,证明了其相较于模态特定及统一基线方法的优越性。
English
Retrieval-Augmented Generation (RAG) has shown substantial promise in improving factual accuracy by grounding model responses with external knowledge relevant to queries. However, most existing RAG approaches are limited to a text-only corpus, and while recent efforts have extended RAG to other modalities such as images and videos, they typically operate over a single modality-specific corpus. In contrast, real-world queries vary widely in the type of knowledge they require, which a single type of knowledge source cannot address. To address this, we introduce UniversalRAG, a novel RAG framework designed to retrieve and integrate knowledge from heterogeneous sources with diverse modalities and granularities. Specifically, motivated by the observation that forcing all modalities into a unified representation space derived from a single combined corpus causes a modality gap, where the retrieval tends to favor items from the same modality as the query, we propose a modality-aware routing mechanism that dynamically identifies the most appropriate modality-specific corpus and performs targeted retrieval within it. Also, beyond modality, we organize each modality into multiple granularity levels, enabling fine-tuned retrieval tailored to the complexity and scope of the query. We validate UniversalRAG on 8 benchmarks spanning multiple modalities, showing its superiority over modality-specific and unified baselines.
PDF623April 30, 2025