長時間動画におけるRAGの再考：何を検索し、どう活用するか？

要旨

検索拡張生成は、テキストから長時間の自己中心的な動画へと拡大しており、システムは複数のモダリティと時間的粒度にわたってクエリに関連するチャンクを選択しなければならない。しかし、VideoRAGの進歩は2つのギャップによって制限されている。既存のベンチマークでは、クエリが動画なしでも回答可能であり、検索エラーが不明瞭になること、そして従来手法ではクエリごとに単一のモダリティ・粒度設定を適用し、チャンクレベルの変動性を無視していることである。我々はこれらの課題に対処するために、〈クエリ, 証拠チャンク, 回答〉のトリプレットからなるベンチマークV-RAGBenchを導入し、検索と生成の忠実で分離された評価を可能にするとともに、複数の設定にわたって並列検索器を実行し、各チャンクに対して最適な設定を特定するチャンク適応型再ランキングを採用したシンプルな手法CARVEを提案する。各チャンクは、検索時に選択された最適な設定のもとで生成器に入力され、チャンクレベルの決定が両段階に伝播するインターリーブされた証拠形式が生成される。CARVEは8つの最新VideoRAGベースラインを上回り、生成器に供給されるチャンクは単一の設定を共有するのではなく、複数の設定をインターリーブしており、これはクエリレベルの手法では達成不可能な動作である。

English

Retrieval-augmented generation is moving beyond text into long, egocentric video, where systems must select query-relevant chunks across multiple modalities and temporal granularities. Yet progress in VideoRAG is limited by two gaps: existing benchmarks allow queries to be answered without the video, obscuring retrieval errors, and prior methods apply a single modality-granularity configuration per query, ignoring chunk-level variability. We address both by introducing V-RAGBench, a benchmark of langlequery, evidence chunk, answerrangle triplets that enables faithful, decoupled evaluation of retrieval and generation, and CARVE, a simple method that runs parallel retrievers across configurations and employs chunk-adaptive reranking to identify the winning configuration for each chunk. Each chunk then enters the generator under its winning configuration selected during retrieval, yielding an interleaved evidence form where the chunk-level decision propagates across both stages. CARVE outperforms eight recent VideoRAG baselines, with the chunks supplied to the generator interleaving multiple configurations rather than sharing a single one, a behavior unattainable by query-level methods.