重新思考長影片中的RAG：檢索什麼以及如何使用？

摘要

檢索增強生成正從文字擴展至長篇、自我中心視角影片，此類系統須跨越多種模態與時間粒度選取與查詢相關的片段。然而，影片RAG的進展受制於兩項落差：現有基準容許無需影片即可回答查詢，模糊了檢索錯誤；先前方法對每個查詢採用單一模態－粒度配置，忽略了片段層級的變異性。我們提出V-RAGBench基準（由⟨查詢、證據片段、答案⟩三元組構成），支援忠實且去耦合的檢索與生成評估；以及CARVE方法，一種簡單做法，能平行執行跨配置的檢索器，並運用片段自適應重排序，為每個片段找出勝出配置。隨後，每個片段以其在檢索階段選出的勝出配置進入生成器，產生一種交錯式證據形式，其中片段層級的決策在兩個階段間傳遞。CARVE勝過八個近期影片RAG基線，提供給生成器的片段交錯多種配置而非共享單一配置，此行為是查詢層級方法無法達成的。

English

Retrieval-augmented generation is moving beyond text into long, egocentric video, where systems must select query-relevant chunks across multiple modalities and temporal granularities. Yet progress in VideoRAG is limited by two gaps: existing benchmarks allow queries to be answered without the video, obscuring retrieval errors, and prior methods apply a single modality-granularity configuration per query, ignoring chunk-level variability. We address both by introducing V-RAGBench, a benchmark of langlequery, evidence chunk, answerrangle triplets that enables faithful, decoupled evaluation of retrieval and generation, and CARVE, a simple method that runs parallel retrievers across configurations and employs chunk-adaptive reranking to identify the winning configuration for each chunk. Each chunk then enters the generator under its winning configuration selected during retrieval, yielding an interleaved evidence form where the chunk-level decision propagates across both stages. CARVE outperforms eight recent VideoRAG baselines, with the chunks supplied to the generator interleaving multiple configurations rather than sharing a single one, a behavior unattainable by query-level methods.