ChatPaper.aiChatPaper

迈向混合模态检索以实现通用检索增强生成

Towards Mixed-Modal Retrieval for Universal Retrieval-Augmented Generation

October 20, 2025
作者: Chenghao Zhang, Guanting Dong, Xinyu Yang, Zhicheng Dou
cs.AI

摘要

檢索增強生成(Retrieval-Augmented Generation, RAG)已成為一種強大的範式,通過從外部語料庫中檢索相關文檔來增強大型語言模型(LLMs)的能力。然而,現有的RAG系統主要專注於單一模態的文本文檔,在現實場景中,當查詢和文檔可能包含混合模態(如文本和圖像)時,這些系統往往表現不足。本文針對通用檢索增強生成(Universal Retrieval-Augmented Generation, URAG)的挑戰,該挑戰涉及檢索和推理混合模態信息以提升視覺語言生成。為此,我們提出了Nyx,一種專為URAG場景設計的統一混合模態到混合模態檢索器。為緩解現實混合模態數據的稀缺性,我們引入了一個四階段的自動化生成與過濾流程,利用網絡文檔構建了NyxQA,這是一個包含多樣化混合模態問答對的數據集,更好地反映了現實世界的信息需求。基於這一高質量數據集,我們為Nyx採用了兩階段訓練框架:首先在NyxQA及多種開源檢索數據集上進行預訓練,隨後利用下游視覺語言模型(VLMs)的反饋進行監督微調,以對齊檢索輸出與生成偏好。實驗結果表明,Nyx不僅在標準的僅文本RAG基準測試中表現出色,而且在更為通用和現實的URAG設置中表現卓越,顯著提升了視覺語言任務中的生成質量。
English
Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for enhancing large language models (LLMs) by retrieving relevant documents from an external corpus. However, existing RAG systems primarily focus on unimodal text documents, and often fall short in real-world scenarios where both queries and documents may contain mixed modalities (such as text and images). In this paper, we address the challenge of Universal Retrieval-Augmented Generation (URAG), which involves retrieving and reasoning over mixed-modal information to improve vision-language generation. To this end, we propose Nyx, a unified mixed-modal to mixed-modal retriever tailored for URAG scenarios. To mitigate the scarcity of realistic mixed-modal data, we introduce a four-stage automated pipeline for generation and filtering, leveraging web documents to construct NyxQA, a dataset comprising diverse mixed-modal question-answer pairs that better reflect real-world information needs. Building on this high-quality dataset, we adopt a two-stage training framework for Nyx: we first perform pre-training on NyxQA along with a variety of open-source retrieval datasets, followed by supervised fine-tuning using feedback from downstream vision-language models (VLMs) to align retrieval outputs with generative preferences. Experimental results demonstrate that Nyx not only performs competitively on standard text-only RAG benchmarks, but also excels in the more general and realistic URAG setting, significantly improving generation quality in vision-language tasks.
PDF312October 21, 2025