滾動驅動:實時自回歸長視頻擴散
Rolling Forcing: Autoregressive Long Video Diffusion in Real Time
September 29, 2025
作者: Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, Shijian Lu
cs.AI
摘要
流媒体视频生成作为交互式世界模型和神经游戏引擎中的一项基础组件,旨在生成高质量、低延迟且时间上连贯的长视频流。然而,现有的大多数工作都面临着严重的误差累积问题,这往往会在长时间范围内显著降低生成视频流的质量。我们设计了滚动强制(Rolling Forcing),这是一种新颖的视频生成技术,能够在最小化误差累积的情况下实现长视频流的生成。滚动强制技术包含三项创新设计。首先,我们摒弃了逐帧迭代采样的传统方法,因其会加速误差传播,转而设计了一种联合去噪方案,该方案能够同时处理多帧图像,并逐步增加噪声水平。这一设计放宽了相邻帧之间的严格因果关系,有效抑制了误差的增长。其次,我们将注意力汇聚机制引入到长时域视频流生成任务中,使模型能够保留初始帧的关键值状态作为全局上下文锚点,从而增强了长期全局一致性。第三,我们设计了一种高效的训练算法,该算法能够在大幅扩展的去噪窗口上进行少步蒸馏。此算法操作于非重叠窗口上,并减轻了基于自生成历史的条件暴露偏差。大量实验表明,滚动强制技术能够在单个GPU上实现多分钟视频的实时流媒体生成,并大幅减少了误差累积。
English
Streaming video generation, as one fundamental component in interactive world
models and neural game engines, aims to generate high-quality, low-latency, and
temporally coherent long video streams. However, most existing work suffers
from severe error accumulation that often significantly degrades the generated
stream videos over long horizons. We design Rolling Forcing, a novel video
generation technique that enables streaming long videos with minimal error
accumulation. Rolling Forcing comes with three novel designs. First, instead of
iteratively sampling individual frames, which accelerates error propagation, we
design a joint denoising scheme that simultaneously denoises multiple frames
with progressively increasing noise levels. This design relaxes the strict
causality across adjacent frames, effectively suppressing error growth. Second,
we introduce the attention sink mechanism into the long-horizon stream video
generation task, which allows the model to keep key value states of initial
frames as a global context anchor and thereby enhances long-term global
consistency. Third, we design an efficient training algorithm that enables
few-step distillation over largely extended denoising windows. This algorithm
operates on non-overlapping windows and mitigates exposure bias conditioned on
self-generated histories. Extensive experiments show that Rolling Forcing
enables real-time streaming generation of multi-minute videos on a single GPU,
with substantially reduced error accumulation.