迈向混合模态检索以实现通用检索增强生成
Towards Mixed-Modal Retrieval for Universal Retrieval-Augmented Generation
October 20, 2025
作者: Chenghao Zhang, Guanting Dong, Xinyu Yang, Zhicheng Dou
cs.AI
摘要
檢索增強生成(Retrieval-Augmented Generation, RAG)已成為一種強大的範式,通過從外部語料庫中檢索相關文檔來增強大型語言模型(LLMs)的能力。然而,現有的RAG系統主要專注於單一模態的文本文檔,在現實場景中,當查詢和文檔可能包含混合模態(如文本和圖像)時,這些系統往往表現不足。本文針對通用檢索增強生成(Universal Retrieval-Augmented Generation, URAG)的挑戰,該挑戰涉及檢索和推理混合模態信息以提升視覺語言生成。為此,我們提出了Nyx,一種專為URAG場景設計的統一混合模態到混合模態檢索器。為緩解現實混合模態數據的稀缺性,我們引入了一個四階段的自動化生成與過濾流程,利用網絡文檔構建了NyxQA,這是一個包含多樣化混合模態問答對的數據集,更好地反映了現實世界的信息需求。基於這一高質量數據集,我們為Nyx採用了兩階段訓練框架:首先在NyxQA及多種開源檢索數據集上進行預訓練,隨後利用下游視覺語言模型(VLMs)的反饋進行監督微調,以對齊檢索輸出與生成偏好。實驗結果表明,Nyx不僅在標準的僅文本RAG基準測試中表現出色,而且在更為通用和現實的URAG設置中表現卓越,顯著提升了視覺語言任務中的生成質量。
English
Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for
enhancing large language models (LLMs) by retrieving relevant documents from an
external corpus. However, existing RAG systems primarily focus on unimodal text
documents, and often fall short in real-world scenarios where both queries and
documents may contain mixed modalities (such as text and images). In this
paper, we address the challenge of Universal Retrieval-Augmented Generation
(URAG), which involves retrieving and reasoning over mixed-modal information to
improve vision-language generation. To this end, we propose Nyx, a unified
mixed-modal to mixed-modal retriever tailored for URAG scenarios. To mitigate
the scarcity of realistic mixed-modal data, we introduce a four-stage automated
pipeline for generation and filtering, leveraging web documents to construct
NyxQA, a dataset comprising diverse mixed-modal question-answer pairs that
better reflect real-world information needs. Building on this high-quality
dataset, we adopt a two-stage training framework for Nyx: we first perform
pre-training on NyxQA along with a variety of open-source retrieval datasets,
followed by supervised fine-tuning using feedback from downstream
vision-language models (VLMs) to align retrieval outputs with generative
preferences. Experimental results demonstrate that Nyx not only performs
competitively on standard text-only RAG benchmarks, but also excels in the more
general and realistic URAG setting, significantly improving generation quality
in vision-language tasks.