重新思考长视频中的RAG：检索什么以及如何使用？

摘要

检索增强生成正从文本领域拓展至长时、以第一人称视角呈现的视频领域，在此类系统中，系统需跨多种模态和时间粒度选择与查询相关的片段。然而，视频检索增强生成（VideoRAG）的进展受限于两个不足：现有基准测试允许无需视频即可回答查询，掩盖了检索错误；此外，先前方法为每个查询采用单一的模态-粒度配置，忽略了片段层面的变异性。为解决这些问题，我们引入了V-RAGBench——一个包含⟨查询，证据片段，答案⟩三元组的基准，支持对检索与生成进行忠实且解耦的评估；以及CARVE——一种简单方法，该方法并行运行多种配置下的检索器，并通过片段自适应重排序为每个片段确定最优配置。每个片段随后以其检索阶段选定的最优配置进入生成器，形成一种交错证据形式，其中片段层面的决策贯穿两个阶段。CARVE的性能优于八种近期VideoRAG基线方法，其提供给生成器的片段混合了多种配置而非共享单一配置，这是查询层面方法无法实现的行为。

English

Retrieval-augmented generation is moving beyond text into long, egocentric video, where systems must select query-relevant chunks across multiple modalities and temporal granularities. Yet progress in VideoRAG is limited by two gaps: existing benchmarks allow queries to be answered without the video, obscuring retrieval errors, and prior methods apply a single modality-granularity configuration per query, ignoring chunk-level variability. We address both by introducing V-RAGBench, a benchmark of langlequery, evidence chunk, answerrangle triplets that enables faithful, decoupled evaluation of retrieval and generation, and CARVE, a simple method that runs parallel retrievers across configurations and employs chunk-adaptive reranking to identify the winning configuration for each chunk. Each chunk then enters the generator under its winning configuration selected during retrieval, yielding an interleaved evidence form where the chunk-level decision propagates across both stages. CARVE outperforms eight recent VideoRAG baselines, with the chunks supplied to the generator interleaving multiple configurations rather than sharing a single one, a behavior unattainable by query-level methods.