Solaris:在《我的世界》中構建多人影片世界模型
Solaris: Building a Multiplayer Video World Model in Minecraft
February 25, 2026
作者: Georgy Savva, Oscar Michel, Daohan Lu, Suppakit Waiwitlikhit, Timothy Meehan, Dhairya Mishra, Srivats Poddar, Jack Lu, Saining Xie
cs.AI
摘要
現有的動作條件化影片生成模型(影片世界模型)僅限於單一智能體視角,無法捕捉真實環境中的多智能體互動。我們推出Solaris——首個多人影片世界模型,能模擬一致的多視角觀測。為實現此目標,我們開發了一套專為《我的世界》等電子遊戲設計的多人數據系統,支持穩健、連續、自動化的數據採集。有別於先前基於單人設定的平台,我們的系統支援協調式多智能體互動與同步化的影片及動作捕捉。透過此系統,我們採集了1,264萬幀多人遊戲畫面,並提出針對多人移動、記憶、實體交互、建造與視角一致性的評估框架。我們採用分階段訓練流程訓練Solaris,從單人建模逐步過渡到多人建模,結合雙向、因果與自強制訓練技術。在最終階段,我們引入檢查點自強制訓練法——一種記憶體效率優化的自強制訓練變體,可實現更長時序的教師指導。實驗結果表明,我們的架構與訓練設計優於現有基準模型。透過開源系統與模型,我們期望為新一代多智能體世界模型奠定基礎。
English
Existing action-conditioned video generation models (video world models) are limited to single-agent perspectives, failing to capture the multi-agent interactions of real-world environments. We introduce Solaris, a multiplayer video world model that simulates consistent multi-view observations. To enable this, we develop a multiplayer data system designed for robust, continuous, and automated data collection on video games such as Minecraft. Unlike prior platforms built for single-player settings, our system supports coordinated multi-agent interaction and synchronized videos + actions capture. Using this system, we collect 12.64 million multiplayer frames and propose an evaluation framework for multiplayer movement, memory, grounding, building, and view consistency. We train Solaris using a staged pipeline that progressively transitions from single-player to multiplayer modeling, combining bidirectional, causal, and Self Forcing training. In the final stage, we introduce Checkpointed Self Forcing, a memory-efficient Self Forcing variant that enables a longer-horizon teacher. Results show our architecture and training design outperform existing baselines. Through open-sourcing our system and models, we hope to lay the groundwork for a new generation of multi-agent world models.