LongLive-RAG：一種用於長視頻生成的通用檢索增強框架

摘要

自回歸（AR）視頻擴散技術可實現可變長度的合成，但長時程生成常面臨誤差累積與身份偏移的問題。為提升效率，現有方法普遍在生成過程中採用滑動窗口注意力機制，這導致生成軌跡不可逆轉：一旦作用窗口累積外觀誤差，後續生成只能依賴此劣化軌跡，進而產生更嚴重的偏移。為解決此限制，我們將長視頻生成重新定義為檢索增強生成（RAG）問題。與僅依賴近期窗口不同，我們將先前生成的潛變量視為可動態搜尋的歷史資訊庫。我們提出LongLive-RAG，這是一個適用於AR視頻生成的通用檢索框架。在每個新區塊生成時，LongLive-RAG透過查詢嵌入向量檢索相關歷史潛變量。此輕量化檢索步驟僅為生成過程增加極小的計算開銷，使生成器能基於非局部上下文而非僅限於近期窗口。為提升檢索性，我們引入窗口時間差分損失函數，可抑制冗餘的局部相似性，並引導嵌入向量捕捉具意義的時間變化。這些組件共同降低滑動窗口注意力導致的誤差累積。在多個AR骨幹網路及不同生成長度下的實驗顯示，本方法能提升長視頻品質，並取得最佳平均VBench-Long排名。據我們所知，在開放式AR長視頻生成方法中，LongLive-RAG是首個將自生成潛變量歷史建立為內容可定址檢索記憶的方法。程式碼已公開於https://github.com/qixinhu11/LongLive-RAG。

English

Autoregressive (AR) video diffusion enables variable-length synthesis, but long-horizon generation often suffers from accumulated errors and identity drift. For efficiency, existing methods commonly adopt sliding-window attention during generation. This creates an irreversible generation trajectory: once the active window accumulates appearance errors, subsequent generations can only condition on this degraded trajectory and drift further away. We address this limitation by formulating long video generation as a retrieval-augmented generation (RAG) problem. Rather than relying solely on the recent window, we treat previously generated latents as a dynamic, searchable history. We propose LongLive-RAG, a general retrieval framework for AR video generation. At each new block, LongLive-RAG uses a query embedding to retrieve relevant historical latents. This lightweight retrieval step adds only a small overhead relative to generation and lets the generator condition on non-local context instead of only the recent window. To make retrieval more discriminative, we introduce the Window Temporal Delta Loss that suppresses redundant local similarity and encourages embeddings to capture meaningful temporal changes. Together, these components help reduce error accumulation caused by sliding-window attention. Experiments across multiple AR backbones and generation lengths show improved long-video quality and the best average VBench-Long rank. To our knowledge, among open-ended AR long video generation methods, LongLive-RAG is the first to formulate self-generated latent history as content-addressable retrieval memory. Code is available at https://github.com/qixinhu11/LongLive-RAG.