滚动式隐空间:连接自回归视频扩散模型中的有限视野训练与开放式测试
Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion
February 8, 2026
作者: Haodong Li, Shaoteng Liu, Zhe Lin, Manmohan Chandraker
cs.AI
摘要
近日,自回归视频扩散模型取得了显著性能突破。然而受限于训练时长,模型在长序列测试时会出现训练-测试差异,导致画面质量快速退化。继针对训练时长内差异研究的Self Forcing之后,本研究聚焦训练时长外的差异问题——即有限时长的训练序列与无限时长的测试序列之间的鸿沟。鉴于无限测试可能超越任何有限训练窗口,且长视频训练计算成本高昂,我们探索无需重新训练的解决方案。通过系统分析自回归缓存维护机制,我们提出了滚动缓存机制Rolling Sink。基于仅用5秒片段训练的Self Forcing模型,Rolling Sink在测试阶段成功将自回归视频生成扩展至超长时长(如16帧/秒下生成5-30分钟视频),并保持主体一致、色彩稳定、结构连贯与运动平滑。大量实验表明,相较于当前最优基线方法,Rolling Sink在长序列生成中实现了更优的视觉保真度与时序一致性。项目页面:https://rolling-sink.github.io/
English
Recently, autoregressive (AR) video diffusion models has achieved remarkable performance. However, due to their limited training durations, a train-test gap emerges when testing at longer horizons, leading to rapid visual degradations. Following Self Forcing, which studies the train-test gap within the training duration, this work studies the train-test gap beyond the training duration, i.e., the gap between the limited horizons during training and open-ended horizons during testing. Since open-ended testing can extend beyond any finite training window, and long-video training is computationally expensive, we pursue a training-free solution to bridge this gap. To explore a training-free solution, we conduct a systematic analysis of AR cache maintenance. These insights lead to Rolling Sink. Built on Self Forcing (trained on only 5s clips), Rolling Sink effectively scales the AR video synthesis to ultra-long durations (e.g., 5-30 minutes at 16 FPS) at test time, with consistent subjects, stable colors, coherent structures, and smooth motions. As demonstrated by extensive experiments, Rolling Sink achieves superior long-horizon visual fidelity and temporal consistency compared to SOTA baselines. Project page: https://rolling-sink.github.io/