ChatPaper.aiChatPaper

矩陣遊戲 3.0:具長時程記憶的即時串流互動世界模型

Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory

April 10, 2026
作者: Zile Wang, Zexiang Liu, Jaixing Li, Kaichen Huang, Baixin Xu, Fei Kang, Mengyin An, Peiyu Wang, Biao Jiang, Yichen Wei, Yidan Xietian, Jiangbo Pei, Liang Hu, Boyi Jiang, Hua Xue, Zidong Wang, Haofeng Sun, Wei Li, Wanli Ouyang, Xianglong He, Yang Liu, Yangguang Li, Yahui Zhou
cs.AI

摘要

隨著互動式視像生成技術的發展,擴散模型日益展現出其作為世界模型的潛力。然而現有方法仍難以同時實現具記憶功能的長期時間一致性與高解析度即時生成,這限制了其實際應用場景。為解決此問題,我們推出Matrix-Game 3.0——一款專為720p即時長影片生成設計的記憶增強型互動世界模型。基於Matrix-Game 2.0的基礎,我們在數據、模型與推理三個層面進行系統性升級:首先,開發升級版工業級無限數據引擎,整合基於Unreal Engine的合成數據、3A遊戲大規模自動化採集與真實世界影片增強技術,實現大規模高品質「影片-姿態-動作-提示詞」四元組數據生產;其次,提出長時序一致性訓練框架,通過建模預測殘差並在訓練中重注入不完美生成幀,使基礎模型學習自我校正,同時結合相機感知的記憶檢索與注入機制實現長時空跨度的一致性;第三,設計基於分佈匹配蒸餾(DMD)的多段落自回歸蒸餾策略,配合模型量化與VAE解碼器剪枝,實現高效即時推理。實驗結果表明,Matrix-Game 3.0能以50億參數模型實現720p解析度下最高40 FPS的即時生成,並在分鐘級序列中保持穩定記憶一致性。擴展至2x140億參數模型後,生成質量、動態表現與泛化能力進一步提升。本方法為工業級可部署世界模型提供了實用化路徑。
English
With the advancement of interactive video generation, diffusion models have increasingly demonstrated their potential as world models. However, existing approaches still struggle to simultaneously achieve memory-enabled long-term temporal consistency and high-resolution real-time generation, limiting their applicability in real-world scenarios. To address this, we present Matrix-Game 3.0, a memory-augmented interactive world model designed for 720p real-time longform video generation. Building upon Matrix-Game 2.0, we introduce systematic improvements across data, model, and inference. First, we develop an upgraded industrial-scale infinite data engine that integrates Unreal Engine-based synthetic data, large-scale automated collection from AAA games, and real-world video augmentation to produce high-quality Video-Pose-Action-Prompt quadruplet data at scale. Second, we propose a training framework for long-horizon consistency: by modeling prediction residuals and re-injecting imperfect generated frames during training, the base model learns self-correction; meanwhile, camera-aware memory retrieval and injection enable the base model to achieve long horizon spatiotemporal consistency. Third, we design a multi-segment autoregressive distillation strategy based on Distribution Matching Distillation (DMD), combined with model quantization and VAE decoder pruning, to achieve efficient real-time inference. Experimental results show that Matrix-Game 3.0 achieves up to 40 FPS real-time generation at 720p resolution with a 5B model, while maintaining stable memory consistency over minute-long sequences. Scaling up to a 2x14B model further improves generation quality, dynamics, and generalization. Our approach provides a practical pathway toward industrial-scale deployable world models.
PDF362April 14, 2026