直播:长时程交互式视频世界建模
LIVE: Long-horizon Interactive Video World Modeling
February 3, 2026
作者: Junchao Huang, Ziyang Ye, Xinting Hu, Tianyu He, Guiyu Zhang, Shaoshuai Shi, Jiang Bian, Li Jiang
cs.AI
摘要
自回归视频世界模型通过动作条件预测未来视觉观测。尽管在短时域内表现有效,但此类模型常因微小预测误差随时间累积而难以进行长时域生成。现有方法通过引入预训练教师模型和序列级分布匹配来缓解此问题,但这会带来额外计算成本,且无法阻止误差超出训练时域的传播。本研究提出LIVE(长时域交互式视频世界模型),通过新颖的循环一致性目标强制约束误差累积范围,从而无需基于教师的蒸馏。具体而言,LIVE首先生成基于真实帧的前向推演,随后执行反向生成过程以重建初始状态。扩散损失函数最终在重建的终止状态上计算,为长时域误差传播提供显式约束。此外,我们提出了统合不同方法的理论框架,并引入渐进式训练课程以稳定训练过程。实验表明,LIVE在长时域基准测试中达到最先进性能,能生成远超训练推演长度的稳定高质量视频。
English
Autoregressive video world models predict future visual observations conditioned on actions. While effective over short horizons, these models often struggle with long-horizon generation, as small prediction errors accumulate over time. Prior methods alleviate this by introducing pre-trained teacher models and sequence-level distribution matching, which incur additional computational cost and fail to prevent error propagation beyond the training horizon. In this work, we propose LIVE, a Long-horizon Interactive Video world modEl that enforces bounded error accumulation via a novel cycle-consistency objective, thereby eliminating the need for teacher-based distillation. Specifically, LIVE first performs a forward rollout from ground-truth frames and then applies a reverse generation process to reconstruct the initial state. The diffusion loss is subsequently computed on the reconstructed terminal state, providing an explicit constraint on long-horizon error propagation. Moreover, we provide an unified view that encompasses different approaches and introduce progressive training curriculum to stabilize training. Experiments demonstrate that LIVE achieves state-of-the-art performance on long-horizon benchmarks, generating stable, high-quality videos far beyond training rollout lengths.