滚动强制：实时自回归长视频扩散

摘要

流媒体视频生成作为交互式世界模型和神经游戏引擎的基础组件之一，旨在生成高质量、低延迟且时间连贯的长视频流。然而，现有大多数工作存在严重的误差累积问题，往往导致生成的视频流在长时间跨度内质量显著下降。我们设计了“滚动强制”（Rolling Forcing）这一新颖的视频生成技术，旨在实现长视频流生成时最小化误差累积。滚动强制技术包含三项创新设计：首先，不同于逐帧迭代采样加速误差传播，我们设计了一种联合去噪方案，该方案能同时对多帧进行去噪处理，且噪声水平逐步递增。这一设计放宽了相邻帧间的严格因果性，有效抑制了误差增长。其次，我们将注意力汇聚机制引入长时域视频流生成任务，使模型能够保留初始帧的关键值状态作为全局上下文锚点，从而增强长期全局一致性。第三，我们设计了一种高效的训练算法，支持在极大扩展的去噪窗口上进行少步蒸馏。该算法作用于非重叠窗口，并基于自生成历史条件缓解了暴露偏差。大量实验表明，滚动强制技术能够在单GPU上实时生成长达数分钟的视频，且显著减少了误差累积。

English

Streaming video generation, as one fundamental component in interactive world models and neural game engines, aims to generate high-quality, low-latency, and temporally coherent long video streams. However, most existing work suffers from severe error accumulation that often significantly degrades the generated stream videos over long horizons. We design Rolling Forcing, a novel video generation technique that enables streaming long videos with minimal error accumulation. Rolling Forcing comes with three novel designs. First, instead of iteratively sampling individual frames, which accelerates error propagation, we design a joint denoising scheme that simultaneously denoises multiple frames with progressively increasing noise levels. This design relaxes the strict causality across adjacent frames, effectively suppressing error growth. Second, we introduce the attention sink mechanism into the long-horizon stream video generation task, which allows the model to keep key value states of initial frames as a global context anchor and thereby enhances long-term global consistency. Third, we design an efficient training algorithm that enables few-step distillation over largely extended denoising windows. This algorithm operates on non-overlapping windows and mitigates exposure bias conditioned on self-generated histories. Extensive experiments show that Rolling Forcing enables real-time streaming generation of multi-minute videos on a single GPU, with substantially reduced error accumulation.

滚动强制：实时自回归长视频扩散

Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

摘要

Support