直播:长时程交互式视频世界建模
LIVE: Long-horizon Interactive Video World Modeling
February 3, 2026
作者: Junchao Huang, Ziyang Ye, Xinting Hu, Tianyu He, Guiyu Zhang, Shaoshuai Shi, Jiang Bian, Li Jiang
cs.AI
摘要
自回归视频世界模型能够根据动作条件预测未来的视觉观测。尽管在短时间范围内效果显著,这类模型往往难以实现长时序生成,因为微小的预测误差会随时间累积。现有方法通过引入预训练的教师模型和序列级分布匹配来缓解此问题,但这会带来额外计算成本,且无法阻止误差超出训练时长范围的传播。本研究提出LIVE(长时序交互式视频世界模型),通过新颖的循环一致性目标强制实现有界误差累积,从而无需基于教师模型的蒸馏。具体而言,LIVE首先从真实帧执行前向推演,随后应用反向生成过程重构初始状态。扩散损失函数在重构的终止状态上计算,从而为长时序误差传播提供显式约束。此外,我们建立了涵盖不同方法的统一框架,并引入渐进式训练课程以稳定训练过程。实验表明,LIVE在长时序基准测试中实现了最先进的性能,能够生成远超训练推演时长的高质量稳定视频。
English
Autoregressive video world models predict future visual observations conditioned on actions. While effective over short horizons, these models often struggle with long-horizon generation, as small prediction errors accumulate over time. Prior methods alleviate this by introducing pre-trained teacher models and sequence-level distribution matching, which incur additional computational cost and fail to prevent error propagation beyond the training horizon. In this work, we propose LIVE, a Long-horizon Interactive Video world modEl that enforces bounded error accumulation via a novel cycle-consistency objective, thereby eliminating the need for teacher-based distillation. Specifically, LIVE first performs a forward rollout from ground-truth frames and then applies a reverse generation process to reconstruct the initial state. The diffusion loss is subsequently computed on the reconstructed terminal state, providing an explicit constraint on long-horizon error propagation. Moreover, we provide an unified view that encompasses different approaches and introduce progressive training curriculum to stabilize training. Experiments demonstrate that LIVE achieves state-of-the-art performance on long-horizon benchmarks, generating stable, high-quality videos far beyond training rollout lengths.