LongLive-RAG: 장편 비디오 생성을 위한 일반적인 검색 증강 프레임워크

초록

자가회귀(AR) 비디오 확산은 가변 길이 합성을 가능하게 하지만, 장기 생성은 종종 누적 오류와 정체성 표류(identity drift)로 인해 어려움을 겪습니다. 효율성을 위해 기존 방법들은 생성 중에 슬라이딩 윈도우 어텐션(sliding-window attention)을 일반적으로 채택합니다. 이는 되돌릴 수 없는 생성 궤적을 만듭니다. 활성 윈도우에 외관 오류가 누적되면, 이후 생성은 이 저하된 궤적에만 조건부로 의존하게 되어 더욱 표류하게 됩니다. 우리는 긴 비디오 생성을 검색 증강 생성(Retrieval-Augmented Generation, RAG) 문제로 공식화하여 이러한 한계를 해결합니다. 최근 윈도우에만 의존하는 대신, 이전에 생성된 잠재 변수들을 동적이고 검색 가능한 히스토리로 취급합니다. 우리는 AR 비디오 생성을 위한 일반 검색 프레임워크인 LongLive-RAG를 제안합니다. 각 새로운 블록에서 LongLive-RAG는 쿼리 임베딩을 사용하여 관련된 과거 잠재 변수들을 검색합니다. 이 경량 검색 단계는 생성에 비해 작은 오버헤드만 추가하며, 생성기가 최근 윈도우 대신 비지역적 컨텍스트(non-local context)에 조건부로 의존할 수 있게 합니다. 검색을 더욱 변별력 있게 만들기 위해, 우리는 중복된 지역적 유사성을 억제하고 임베딩이 의미 있는 시간적 변화를 포착하도록 장려하는 Window Temporal Delta Loss를 도입합니다. 함께, 이러한 구성 요소들은 슬라이딩 윈도우 어텐션으로 인한 오류 누적을 줄이는 데 도움을 줍니다. 여러 AR 백본과 생성 길이에 걸친 실험은 향상된 장기 비디오 품질과 최고의 평균 VBench-Long 순위를 보여줍니다. 우리가 아는 한, 개방형(open-ended) AR 장기 비디오 생성 방법 중 LongLive-RAG는 자체 생성된 잠재 변수 히스토리를 내용 주소 지정 가능 검색 메모리(content-addressable retrieval memory)로 공식화한 최초의 방법입니다. 코드는 https://github.com/qixinhu11/LongLive-RAG에서 이용할 수 있습니다.

English

Autoregressive (AR) video diffusion enables variable-length synthesis, but long-horizon generation often suffers from accumulated errors and identity drift. For efficiency, existing methods commonly adopt sliding-window attention during generation. This creates an irreversible generation trajectory: once the active window accumulates appearance errors, subsequent generations can only condition on this degraded trajectory and drift further away. We address this limitation by formulating long video generation as a retrieval-augmented generation (RAG) problem. Rather than relying solely on the recent window, we treat previously generated latents as a dynamic, searchable history. We propose LongLive-RAG, a general retrieval framework for AR video generation. At each new block, LongLive-RAG uses a query embedding to retrieve relevant historical latents. This lightweight retrieval step adds only a small overhead relative to generation and lets the generator condition on non-local context instead of only the recent window. To make retrieval more discriminative, we introduce the Window Temporal Delta Loss that suppresses redundant local similarity and encourages embeddings to capture meaningful temporal changes. Together, these components help reduce error accumulation caused by sliding-window attention. Experiments across multiple AR backbones and generation lengths show improved long-video quality and the best average VBench-Long rank. To our knowledge, among open-ended AR long video generation methods, LongLive-RAG is the first to formulate self-generated latent history as content-addressable retrieval memory. Code is available at https://github.com/qixinhu11/LongLive-RAG.