多世界：可擴展的多智能體多視角影片世界模型

摘要

影片世界模型在使用者或智能體執行動作時，成功實現了對環境動態的卓越模擬能力。此類模型以動作條件化的影片生成架構為基礎，透過輸入歷史影格與當前動作來預測未來影格。然而，現有方法多侷限於單一智能體場景，難以捕捉真實世界多智能體系統中固有的複雜互動。我們提出MultiWorld——一個支援多智能體多視角世界建模的統一框架，既能實現對多個智能體的精准控制，又能維持跨視角的一致性。我們引入「多智能體條件模組」來達成精確的多智能體可控性，並透過「全局狀態編碼器」確保不同視角間觀測結果的協調性。MultiWorld支援靈活擴展智能體數量與視角配置，並能並行合成不同視角以提升效率。在多玩家遊戲環境與多機器人協作任務上的實驗表明，MultiWorld在影片擬真度、動作跟隨能力與多視角一致性方面均超越基準方法。項目頁面：https://multi-world.github.io/

English

Video world models have achieved remarkable success in simulating environmental dynamics in response to actions by users or agents. They are modeled as action-conditioned video generation models that take historical frames and current actions as input to predict future frames. Yet, most existing approaches are limited to single-agent scenarios and fail to capture the complex interactions inherent in real-world multi-agent systems. We present MultiWorld, a unified framework for multi-agent multi-view world modeling that enables accurate control of multiple agents while maintaining multi-view consistency. We introduce the Multi-Agent Condition Module to achieve precise multi-agent controllability, and the Global State Encoder to ensure coherent observations across different views. MultiWorld supports flexible scaling of agent and view counts, and synthesizes different views in parallel for high efficiency. Experiments on multi-player game environments and multi-robot manipulation tasks demonstrate that MultiWorld outperforms baselines in video fidelity, action-following ability, and multi-view consistency. Project page: https://multi-world.github.io/

多世界：可擴展的多智能體多視角影片世界模型

MultiWorld: Scalable Multi-Agent Multi-View Video World Models

摘要

Support