UniversalRAG：跨多模态與多粒度語料庫的檢索增強生成

摘要

檢索增強生成（Retrieval-Augmented Generation, RAG）在通過將模型回應與查詢相關的外部知識相結合來提升事實準確性方面展現了顯著潛力。然而，現有的大多數RAG方法僅限於純文本語料庫，儘管近期研究已將RAG擴展至圖像和視頻等其他模態，但它們通常仍局限於單一模態特定的語料庫。相比之下，現實世界中的查詢所需知識類型多種多樣，單一類型的知識源無法全面應對。為此，我們提出了UniversalRAG，這是一種新穎的RAG框架，旨在從具有多樣模態和粒度層次的異構知識源中檢索並整合知識。具體而言，基於觀察到強制所有模態進入由單一合併語料庫導出的統一表示空間會導致模態間隔閡，即檢索傾向於偏好與查詢同模態的項目，我們提出了一種模態感知路由機制，該機制能動態識別最合適的模態特定語料庫並在其中進行定向檢索。此外，超越模態本身，我們將每一模態組織成多個粒度層次，從而實現根據查詢的複雜性和範圍進行精細調整的檢索。我們在涵蓋多種模態的8個基準測試上驗證了UniversalRAG，展示了其相較於模態特定和統一基線模型的優越性。

English

Retrieval-Augmented Generation (RAG) has shown substantial promise in improving factual accuracy by grounding model responses with external knowledge relevant to queries. However, most existing RAG approaches are limited to a text-only corpus, and while recent efforts have extended RAG to other modalities such as images and videos, they typically operate over a single modality-specific corpus. In contrast, real-world queries vary widely in the type of knowledge they require, which a single type of knowledge source cannot address. To address this, we introduce UniversalRAG, a novel RAG framework designed to retrieve and integrate knowledge from heterogeneous sources with diverse modalities and granularities. Specifically, motivated by the observation that forcing all modalities into a unified representation space derived from a single combined corpus causes a modality gap, where the retrieval tends to favor items from the same modality as the query, we propose a modality-aware routing mechanism that dynamically identifies the most appropriate modality-specific corpus and performs targeted retrieval within it. Also, beyond modality, we organize each modality into multiple granularity levels, enabling fine-tuned retrieval tailored to the complexity and scope of the query. We validate UniversalRAG on 8 benchmarks spanning multiple modalities, showing its superiority over modality-specific and unified baselines.

UniversalRAG：跨多模态與多粒度語料庫的檢索增強生成

UniversalRAG: Retrieval-Augmented Generation over Multiple Corpora with Diverse Modalities and Granularities

摘要

Support