LongLive-RAG：一种用于长视频生成的通用检索增强框架

摘要

自回归（AR）视频扩散能够实现变长合成，但长时生成常面临累积误差和身份漂移问题。为提升效率，现有方法在生成过程中普遍采用滑动窗口注意力机制，这导致了不可逆的生成轨迹：一旦活动窗口累积外观误差，后续生成只能基于退化轨迹继续演化，进而偏离目标。我们通过将长视频生成建模为检索增强生成（RAG）问题来突破这一限制。不同于仅依赖近期窗口，我们将先前生成的潜变量视作动态可检索的历史库。为此提出LongLive-RAG——面向AR视频生成的通用检索框架。在每个新块生成时，LongLive-RAG通过查询嵌入向量检索相关历史潜变量。这一轻量级检索步骤相较于生成过程仅增加微小开销，使生成器能够以非局部上下文而非仅限近期窗口为条件。为增强检索区分度，我们引入窗口时域差分损失函数，该损失可抑制冗余的局部相似性，促使嵌入向量捕捉有意义的时域变化。上述组件协同作用，有效缓解了滑动窗口注意力引发的误差累积。在多种AR骨干网络和生成时长下的实验表明，该方法能够提升长视频质量，并取得了目前最优的VBench-Long平均排名。据我们所知，在开放式AR长视频生成方法中，LongLive-RAG首次将自生成潜变量历史构建为可寻址检索记忆。代码已开源：https://github.com/qixinhu11/LongLive-RAG。

English

Autoregressive (AR) video diffusion enables variable-length synthesis, but long-horizon generation often suffers from accumulated errors and identity drift. For efficiency, existing methods commonly adopt sliding-window attention during generation. This creates an irreversible generation trajectory: once the active window accumulates appearance errors, subsequent generations can only condition on this degraded trajectory and drift further away. We address this limitation by formulating long video generation as a retrieval-augmented generation (RAG) problem. Rather than relying solely on the recent window, we treat previously generated latents as a dynamic, searchable history. We propose LongLive-RAG, a general retrieval framework for AR video generation. At each new block, LongLive-RAG uses a query embedding to retrieve relevant historical latents. This lightweight retrieval step adds only a small overhead relative to generation and lets the generator condition on non-local context instead of only the recent window. To make retrieval more discriminative, we introduce the Window Temporal Delta Loss that suppresses redundant local similarity and encourages embeddings to capture meaningful temporal changes. Together, these components help reduce error accumulation caused by sliding-window attention. Experiments across multiple AR backbones and generation lengths show improved long-video quality and the best average VBench-Long rank. To our knowledge, among open-ended AR long video generation methods, LongLive-RAG is the first to formulate self-generated latent history as content-addressable retrieval memory. Code is available at https://github.com/qixinhu11/LongLive-RAG.