LongLive-RAG: 長尺動画生成のための汎用検索拡張フレームワーク

要旨

自己回帰（AR）動画拡散は可変長合成を可能にするが、長時間にわたる生成では誤差の蓄積と同一性のずれ（identity drift）が頻繁に生じる。既存手法では効率化のため、生成時にスライディングウィンドウ注意（sliding-window attention）を採用するのが一般的である。これにより不可逆的な生成軌跡が生じる。すなわち、一度アクティブウィンドウに外観の誤差が蓄積されると、それ以降の生成はこの劣化した軌跡にのみ条件付けられ、さらにずれが拡大する。本研究では、長時間動画生成を検索拡張生成（RAG）問題として定式化することで、この制約に対処する。直近のウィンドウのみに依存するのではなく、過去に生成された潜在変数を動的かつ検索可能な履歴として扱う。我々はAR動画生成のための汎用検索フレームワーク「LongLive-RAG」を提案する。各新しいブロックにおいて、LongLive-RAGはクエリ埋め込みを用いて関連する過去の潜在変数を検索する。この軽量な検索ステップは生成に比べてわずかなオーバーヘッドしか追加せず、生成器が直近のウィンドウだけでなく非局所的な文脈に条件付けられるようにする。検索をより識別的にするため、冗長な局所的類似性を抑制し、埋め込みが意味のある時間変化を捉えるよう促す「Window Temporal Delta Loss」を導入する。これらの要素は、スライディングウィンドウ注意による誤差蓄積の低減に寄与する。複数のARバックボーンと生成長に対する実験により、長期動画品質が改善され、平均VBench-Longランクで最良の結果が得られた。我々の知る限り、開放型AR長時間動画生成手法の中で、LongLive-RAGは自己生成された潜在履歴を内容アドレス可能な検索メモリとして定式化した最初の手法である。コードはhttps://github.com/qixinhu11/LongLive-RAGで公開されている。

English

Autoregressive (AR) video diffusion enables variable-length synthesis, but long-horizon generation often suffers from accumulated errors and identity drift. For efficiency, existing methods commonly adopt sliding-window attention during generation. This creates an irreversible generation trajectory: once the active window accumulates appearance errors, subsequent generations can only condition on this degraded trajectory and drift further away. We address this limitation by formulating long video generation as a retrieval-augmented generation (RAG) problem. Rather than relying solely on the recent window, we treat previously generated latents as a dynamic, searchable history. We propose LongLive-RAG, a general retrieval framework for AR video generation. At each new block, LongLive-RAG uses a query embedding to retrieve relevant historical latents. This lightweight retrieval step adds only a small overhead relative to generation and lets the generator condition on non-local context instead of only the recent window. To make retrieval more discriminative, we introduce the Window Temporal Delta Loss that suppresses redundant local similarity and encourages embeddings to capture meaningful temporal changes. Together, these components help reduce error accumulation caused by sliding-window attention. Experiments across multiple AR backbones and generation lengths show improved long-video quality and the best average VBench-Long rank. To our knowledge, among open-ended AR long video generation methods, LongLive-RAG is the first to formulate self-generated latent history as content-addressable retrieval memory. Code is available at https://github.com/qixinhu11/LongLive-RAG.