Solaris:在《我的世界》中构建多人视频世界模型
Solaris: Building a Multiplayer Video World Model in Minecraft
February 25, 2026
作者: Georgy Savva, Oscar Michel, Daohan Lu, Suppakit Waiwitlikhit, Timothy Meehan, Dhairya Mishra, Srivats Poddar, Jack Lu, Saining Xie
cs.AI
摘要
现有基于动作条件的视频生成模型(视频世界模型)局限于单一智能体视角,无法捕捉真实环境中的多智能体交互。我们推出Solaris——首个支持多玩家视角的视频世界模型,能够模拟一致的多视角观测。为实现这一目标,我们开发了专为《我的世界》等游戏设计的多人数据系统,支持稳健、持续、自动化的数据采集。与先前针对单玩家场景构建的平台不同,我们的系统支持协同多智能体交互及同步的视频动作捕捉。基于该系统,我们收集了1264万帧多人游戏数据,并提出了涵盖移动、记忆、实体交互、建造和视角一致性的多智能体评估框架。我们采用分阶段训练流程训练Solaris,通过双向建模、因果建模和自强制训练的渐进式组合,实现从单玩家到多玩家建模的平滑过渡。在最终阶段,我们引入了检查点自强制训练——一种内存高效的自强制训练变体,可实现更长视野的教师指导。实验表明,我们的架构和训练设计优于现有基线模型。通过开源系统与模型,我们希望为新一代多智能体世界模型奠定基础。
English
Existing action-conditioned video generation models (video world models) are limited to single-agent perspectives, failing to capture the multi-agent interactions of real-world environments. We introduce Solaris, a multiplayer video world model that simulates consistent multi-view observations. To enable this, we develop a multiplayer data system designed for robust, continuous, and automated data collection on video games such as Minecraft. Unlike prior platforms built for single-player settings, our system supports coordinated multi-agent interaction and synchronized videos + actions capture. Using this system, we collect 12.64 million multiplayer frames and propose an evaluation framework for multiplayer movement, memory, grounding, building, and view consistency. We train Solaris using a staged pipeline that progressively transitions from single-player to multiplayer modeling, combining bidirectional, causal, and Self Forcing training. In the final stage, we introduce Checkpointed Self Forcing, a memory-efficient Self Forcing variant that enables a longer-horizon teacher. Results show our architecture and training design outperform existing baselines. Through open-sourcing our system and models, we hope to lay the groundwork for a new generation of multi-agent world models.