긴 동영상에서의 RAG 재고찰: 무엇을 검색하고 어떻게 활용할 것인가?

초록

검색 증강 생성이 텍스트를 넘어 장시간의 자기중심적(egocentric) 비디오로 확장되고 있으며, 이러한 시스템은 다중 모달리티와 시간적 세분성에 걸쳐 질의 관련 청크(chunk)를 선택해야 한다. 그러나 비디오RAG(VideoRAG)의 발전은 두 가지 격차로 인해 제한된다. 기존 벤치마크는 비디오 없이도 질의에 답할 수 있어 검색 오류를 모호하게 하며, 기존 방법은 질의당 단일 모달리티-세분성 구성을 적용하여 청크 수준의 변동성을 무시한다. 우리는 이 두 문제를 해결하기 위해 검색과 생성을 충실하게 분리 평가할 수 있는 ⟨질의, 증거 청크, 답⟩ 삼중항(triplet)으로 구성된 벤치마크인 V-RAGBench를 도입하고, 다양한 구성에 걸쳐 병렬 검색기를 실행하며 청크 적응형 재순위화(chunk-adaptive reranking)를 통해 각 청크에 최적의 구성을 식별하는 간단한 방법인 CARVE를 제안한다. 각 청크는 검색 중 선택된 최적 구성 하에 생성기로 전달되며, 청크 수준의 결정이 두 단계에 걸쳐 전파되는 인터리브된(interleaved) 증거 형태를 생성한다. CARVE는 최근 8개의 비디오RAG 기준 방법보다 우수한 성능을 보이며, 생성기에 제공되는 청크가 단일 구성을 공유하는 대신 여러 구성을 인터리브하는데, 이는 질의 수준 방법으로는 달성할 수 없는 동작이다.

English

Retrieval-augmented generation is moving beyond text into long, egocentric video, where systems must select query-relevant chunks across multiple modalities and temporal granularities. Yet progress in VideoRAG is limited by two gaps: existing benchmarks allow queries to be answered without the video, obscuring retrieval errors, and prior methods apply a single modality-granularity configuration per query, ignoring chunk-level variability. We address both by introducing V-RAGBench, a benchmark of langlequery, evidence chunk, answerrangle triplets that enables faithful, decoupled evaluation of retrieval and generation, and CARVE, a simple method that runs parallel retrievers across configurations and employs chunk-adaptive reranking to identify the winning configuration for each chunk. Each chunk then enters the generator under its winning configuration selected during retrieval, yielding an interleaved evidence form where the chunk-level decision propagates across both stages. CARVE outperforms eight recent VideoRAG baselines, with the chunks supplied to the generator interleaving multiple configurations rather than sharing a single one, a behavior unattainable by query-level methods.