ChatPaper.aiChatPaper

Matrix-Game 3.0:具备长时记忆的实时流式交互世界模型

Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory

April 10, 2026
作者: Zile Wang, Zexiang Liu, Jaixing Li, Kaichen Huang, Baixin Xu, Fei Kang, Mengyin An, Peiyu Wang, Biao Jiang, Yichen Wei, Yidan Xietian, Jiangbo Pei, Liang Hu, Boyi Jiang, Hua Xue, Zidong Wang, Haofeng Sun, Wei Li, Wanli Ouyang, Xianglong He, Yang Liu, Yangguang Li, Yahui Zhou
cs.AI

摘要

随着交互式视频生成技术的进步,扩散模型日益展现出作为世界模型的潜力。然而,现有方法仍难以同时实现具备记忆能力的长时序一致性及高分辨率实时生成,这限制了其在实际场景中的应用。为此,我们推出Matrix-Game 3.0——一个专为720p长视频实时生成设计的记忆增强型交互世界模型。在Matrix-Game 2.0基础上,我们从数据、模型和推理三个维度进行了系统性升级。首先,我们开发了升级版工业级无限数据引擎,通过整合基于虚幻引擎的合成数据、AAA游戏大规模自动化采集以及真实世界视频增强技术,实现了高质量“视频-姿态-动作-提示词”四元组数据的规模化生产。其次,我们提出了长时序一致性训练框架:通过建模预测残差并在训练中重新注入不完美生成帧,使基础模型学会自我校正;同时,相机感知的记忆检索与注入机制使基础模型能够实现长跨度时空一致性。第三,我们基于分布匹配蒸馏(DMD)设计了多段自回归蒸馏策略,结合模型量化和VAE解码器剪枝,实现了高效实时推理。实验结果表明,Matrix-Game 3.0在5B参数规模下可实现720p分辨率下最高40 FPS的实时生成,并在分钟级序列中保持稳定的记忆一致性。将模型扩展至2x14B规模后,生成质量、动态效果和泛化能力得到进一步提升。本方法为构建可工业部署的世界模型提供了可行路径。
English
With the advancement of interactive video generation, diffusion models have increasingly demonstrated their potential as world models. However, existing approaches still struggle to simultaneously achieve memory-enabled long-term temporal consistency and high-resolution real-time generation, limiting their applicability in real-world scenarios. To address this, we present Matrix-Game 3.0, a memory-augmented interactive world model designed for 720p real-time longform video generation. Building upon Matrix-Game 2.0, we introduce systematic improvements across data, model, and inference. First, we develop an upgraded industrial-scale infinite data engine that integrates Unreal Engine-based synthetic data, large-scale automated collection from AAA games, and real-world video augmentation to produce high-quality Video-Pose-Action-Prompt quadruplet data at scale. Second, we propose a training framework for long-horizon consistency: by modeling prediction residuals and re-injecting imperfect generated frames during training, the base model learns self-correction; meanwhile, camera-aware memory retrieval and injection enable the base model to achieve long horizon spatiotemporal consistency. Third, we design a multi-segment autoregressive distillation strategy based on Distribution Matching Distillation (DMD), combined with model quantization and VAE decoder pruning, to achieve efficient real-time inference. Experimental results show that Matrix-Game 3.0 achieves up to 40 FPS real-time generation at 720p resolution with a 5B model, while maintaining stable memory consistency over minute-long sequences. Scaling up to a 2x14B model further improves generation quality, dynamics, and generalization. Our approach provides a practical pathway toward industrial-scale deployable world models.
PDF362April 14, 2026