LongVie 2:多模态可控超长视频世界模型
LongVie 2: Multimodal Controllable Ultra-Long Video World Model
December 15, 2025
作者: Jianxiong Gao, Zhaoxi Chen, Xian Liu, Junhao Zhuang, Chengming Xu, Jianfeng Feng, Yu Qiao, Yanwei Fu, Chenyang Si, Ziwei Liu
cs.AI
摘要
基于预训练视频生成系统构建视频世界模型,是实现通用时空智能的重要而关键的一步。一个理想的世界模型应具备三大核心特性:可控性、长期视觉质量与时间一致性。为此,我们采用渐进式策略——先提升可控性,再向长期高质量生成拓展。我们提出LongVie 2这一端到端自回归框架,通过三阶段训练实现目标:(1)多模态引导技术融合稠密与稀疏控制信号,提供隐式世界级监督以增强可控性;(2)针对输入帧的退化感知训练,弥合训练与长期推理间的差距以保持高视觉质量;(3)历史上下文引导机制,通过对齐相邻片段间的语境信息确保时间一致性。我们进一步推出LongVGenBench综合评测基准,包含100段涵盖真实与合成场景的高清一分钟视频。大量实验表明,LongVie 2在长程可控性、时序连贯性与视觉保真度方面达到业界最优水平,支持持续生成长达五分钟的视频,为统一视频世界建模迈出重要一步。
English
Building video world models upon pretrained video generation systems represents an important yet challenging step toward general spatiotemporal intelligence. A world model should possess three essential properties: controllability, long-term visual quality, and temporal consistency. To this end, we take a progressive approach-first enhancing controllability and then extending toward long-term, high-quality generation. We present LongVie 2, an end-to-end autoregressive framework trained in three stages: (1) Multi-modal guidance, which integrates dense and sparse control signals to provide implicit world-level supervision and improve controllability; (2) Degradation-aware training on the input frame, bridging the gap between training and long-term inference to maintain high visual quality; and (3) History-context guidance, which aligns contextual information across adjacent clips to ensure temporal consistency. We further introduce LongVGenBench, a comprehensive benchmark comprising 100 high-resolution one-minute videos covering diverse real-world and synthetic environments. Extensive experiments demonstrate that LongVie 2 achieves state-of-the-art performance in long-range controllability, temporal coherence, and visual fidelity, and supports continuous video generation lasting up to five minutes, marking a significant step toward unified video world modeling.