매트릭스-게임 3.0: 장기 기억을 갖춘 실시간 스트리밍 인터랙티브 월드 모델

초록

상호작용 비디오 생성 기술의 발전에 따라 확산 모델(diffusion model)이 월드 모델(world model)로서의 잠재력을 점차 보여주고 있습니다. 그러나 기존 접근법은 메모리 기반 장기 시간 일관성과 고해상도 실시간 생성을 동시에 달성하는 데 여전히 어려움을 겪으며, 실제 적용 가능성을 제한하고 있습니다. 이를 해결하기 위해 본 논문에서는 720p 실시간 장편 비디오 생성을 위해 설계된 메모리 증강 상호작용 월드 모델인 Matrix-Game 3.0을 제안합니다. Matrix-Game 2.0을 기반으로 데이터, 모델, 추론 측면에 걸쳐 체계적인 개선을 도입했습니다. 첫째, 언리얼 엔진 기반 합성 데이터, AAA 게임의 대규모 자동화 수집, 실제 영상 증강을 통합하여 고품질의 비디오-포즈-액션-프롬프트 4중항 데이터를 대규모로 생성하는 업그레이드된 산업 규모의 무한 데이터 엔진을 개발했습니다. 둘째, 장기간 일관성을 위한 학습 프레임워크를 제안합니다: 예측 잔차(residual)를 모델링하고 학습 중 생성된 불완전한 프레임을 재주입함으로써 기본 모델이 자기 수정(self-correction)을 학습합니다. 동시에 카메라 인지 메모리 검색 및 주입을 통해 기본 모델이 장기간 시공간 일관성을 달성할 수 있습니다. 셋째, 분포 정합 증류(Distribution Matching Distillation, DMD) 기반의 다중 세그먼트 자기회귀 증류 전략을 설계하고, 모델 양자화 및 VAE 디코더 가지치기(pruning)와 결합하여 효율적인 실시간 추론을 구현했습니다. 실험 결과, Matrix-Game 3.0은 5B 매개변수 모델로 720p 해상도에서 최대 40 FPS의 실시간 생성을 달성하면서 수 분 길이의 시퀀스에 걸쳐 안정적인 메모리 일관성을 유지했습니다. 모델을 2x14B로 확장하면 생성 품질, 역동성 및 일반화 성능이 더욱 향상되었습니다. 본 접근법은 산업 규모로 배포 가능한 월드 모델을 위한 실용적인 경로를 제시합니다.

English

With the advancement of interactive video generation, diffusion models have increasingly demonstrated their potential as world models. However, existing approaches still struggle to simultaneously achieve memory-enabled long-term temporal consistency and high-resolution real-time generation, limiting their applicability in real-world scenarios. To address this, we present Matrix-Game 3.0, a memory-augmented interactive world model designed for 720p real-time longform video generation. Building upon Matrix-Game 2.0, we introduce systematic improvements across data, model, and inference. First, we develop an upgraded industrial-scale infinite data engine that integrates Unreal Engine-based synthetic data, large-scale automated collection from AAA games, and real-world video augmentation to produce high-quality Video-Pose-Action-Prompt quadruplet data at scale. Second, we propose a training framework for long-horizon consistency: by modeling prediction residuals and re-injecting imperfect generated frames during training, the base model learns self-correction; meanwhile, camera-aware memory retrieval and injection enable the base model to achieve long horizon spatiotemporal consistency. Third, we design a multi-segment autoregressive distillation strategy based on Distribution Matching Distillation (DMD), combined with model quantization and VAE decoder pruning, to achieve efficient real-time inference. Experimental results show that Matrix-Game 3.0 achieves up to 40 FPS real-time generation at 720p resolution with a 5B model, while maintaining stable memory consistency over minute-long sequences. Scaling up to a 2x14B model further improves generation quality, dynamics, and generalization. Our approach provides a practical pathway toward industrial-scale deployable world models.

매트릭스-게임 3.0: 장기 기억을 갖춘 실시간 스트리밍 인터랙티브 월드 모델

Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory

초록

Support