迈向混合模态检索,实现通用检索增强生成
Towards Mixed-Modal Retrieval for Universal Retrieval-Augmented Generation
October 20, 2025
作者: Chenghao Zhang, Guanting Dong, Xinyu Yang, Zhicheng Dou
cs.AI
摘要
检索增强生成(Retrieval-Augmented Generation, RAG)作为一种增强大型语言模型(LLMs)效能的强大范式,通过从外部语料库中检索相关文档来实现。然而,现有的RAG系统主要聚焦于单模态文本文档,在现实场景中,当查询与文档均可能包含混合模态(如文本与图像)时,往往表现不足。本文针对通用检索增强生成(Universal Retrieval-Augmented Generation, URAG)的挑战展开研究,该挑战涉及检索并推理混合模态信息以提升视觉-语言生成能力。为此,我们提出了Nyx,一个专为URAG场景设计的统一混合模态到混合模态检索器。为缓解现实混合模态数据稀缺的问题,我们引入了一个四阶段自动化生成与过滤流程,利用网络文档构建了NyxQA数据集,该数据集包含多样化的混合模态问答对,更贴近现实世界的信息需求。基于这一高质量数据集,我们为Nyx采用了两阶段训练框架:首先在NyxQA及多种开源检索数据集上进行预训练,随后利用下游视觉-语言模型(VLMs)的反馈进行监督微调,以确保检索输出与生成偏好对齐。实验结果表明,Nyx不仅在标准纯文本RAG基准测试中表现优异,在更为通用且现实的URAG设置下也表现突出,显著提升了视觉-语言任务中的生成质量。
English
Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for
enhancing large language models (LLMs) by retrieving relevant documents from an
external corpus. However, existing RAG systems primarily focus on unimodal text
documents, and often fall short in real-world scenarios where both queries and
documents may contain mixed modalities (such as text and images). In this
paper, we address the challenge of Universal Retrieval-Augmented Generation
(URAG), which involves retrieving and reasoning over mixed-modal information to
improve vision-language generation. To this end, we propose Nyx, a unified
mixed-modal to mixed-modal retriever tailored for URAG scenarios. To mitigate
the scarcity of realistic mixed-modal data, we introduce a four-stage automated
pipeline for generation and filtering, leveraging web documents to construct
NyxQA, a dataset comprising diverse mixed-modal question-answer pairs that
better reflect real-world information needs. Building on this high-quality
dataset, we adopt a two-stage training framework for Nyx: we first perform
pre-training on NyxQA along with a variety of open-source retrieval datasets,
followed by supervised fine-tuning using feedback from downstream
vision-language models (VLMs) to align retrieval outputs with generative
preferences. Experimental results demonstrate that Nyx not only performs
competitively on standard text-only RAG benchmarks, but also excels in the more
general and realistic URAG setting, significantly improving generation quality
in vision-language tasks.