Matrix-Game 3.0：長期的記憶を備えたリアルタイム・ストリーミング対話型世界モデル

要旨

対話型ビデオ生成の進化に伴い、拡散モデルは世界モデルとしての潜在能力をますます示している。しかし、既存の手法では、メモリを活用した長期的な時間的一貫性と高解像度のリアルタイム生成を同時に実現することが依然として困難であり、実世界での応用可能性が制限されている。この課題に対処するため、我々は720p解像度での長時間リアルタイムビデオ生成を目的とした、メモリ拡張型対話世界モデル「Matrix-Game 3.0」を提案する。Matrix-Game 2.0を基盤とし、データ、モデル、推論の3つの側面で体系的な改良を加えた。第一に、Unreal Engineベースの合成データ、AAAゲームからの大規模自動収集、実世界ビデオの拡張を統合した、産業規模のアップグレード版無限データエンジンを開発した。これにより、高品質な「ビデオ-姿勢-動作-プロンプト」四重項データを大規模に生成する。第二に、長期的な一貫性のための訓練フレームワークを提案する。予測残差をモデル化し、訓練中に不完全な生成フレームを再注入することで、基本モデルは自己修正を学習する。同時に、カメラを意識したメモリ検索と注入により、基本モデルは長期的な時空間的一貫性を達成する。第三に、Distribution Matching Distillation (DMD) に基づくマルチセグメント自己回帰的蒸留戦略を設計し、モデル量子化とVAEデコーダの枝刈りを組み合わせることで、効率的なリアルタイム推論を実現した。実験結果では、Matrix-Game 3.0が5Bパラメータモデルで720p解像度・最大40 FPSのリアルタイム生成を達成し、分単位のシーケンスにわたって安定したメモリ一貫性を維持することを示している。2x14Bモデルへのスケールアップにより、生成品質、動的表現、汎化性能がさらに向上する。本手法は、産業規模での展開が可能な世界モデルへの実用的な道筋を提供する。

English

With the advancement of interactive video generation, diffusion models have increasingly demonstrated their potential as world models. However, existing approaches still struggle to simultaneously achieve memory-enabled long-term temporal consistency and high-resolution real-time generation, limiting their applicability in real-world scenarios. To address this, we present Matrix-Game 3.0, a memory-augmented interactive world model designed for 720p real-time longform video generation. Building upon Matrix-Game 2.0, we introduce systematic improvements across data, model, and inference. First, we develop an upgraded industrial-scale infinite data engine that integrates Unreal Engine-based synthetic data, large-scale automated collection from AAA games, and real-world video augmentation to produce high-quality Video-Pose-Action-Prompt quadruplet data at scale. Second, we propose a training framework for long-horizon consistency: by modeling prediction residuals and re-injecting imperfect generated frames during training, the base model learns self-correction; meanwhile, camera-aware memory retrieval and injection enable the base model to achieve long horizon spatiotemporal consistency. Third, we design a multi-segment autoregressive distillation strategy based on Distribution Matching Distillation (DMD), combined with model quantization and VAE decoder pruning, to achieve efficient real-time inference. Experimental results show that Matrix-Game 3.0 achieves up to 40 FPS real-time generation at 720p resolution with a 5B model, while maintaining stable memory consistency over minute-long sequences. Scaling up to a 2x14B model further improves generation quality, dynamics, and generalization. Our approach provides a practical pathway toward industrial-scale deployable world models.

Matrix-Game 3.0：長期的記憶を備えたリアルタイム・ストリーミング対話型世界モデル

Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory

要旨

Support