多世界：可扩展的多智能体多视角视频世界模型

摘要

视频世界模型在模拟用户或智能体动作引发的环境动态方面已取得显著成就。这类模型通常被构建为动作条件化的视频生成模型，以历史帧和当前动作为输入来预测未来帧序列。然而，现有方法大多局限于单智能体场景，难以捕捉现实世界多智能体系统中固有的复杂交互。我们提出MultiWorld——一个面向多智能体多视角世界建模的统一框架，该框架在保持多视角一致性的同时，实现了对多个智能体的精准控制。我们引入了多智能体条件模块以实现精确的多智能体可控性，并采用全局状态编码器来确保不同视角间的观测一致性。MultiWorld支持智能体数量和视角数量的灵活扩展，并能并行合成不同视角以提升效率。在多玩家游戏环境和多机器人操作任务上的实验表明，MultiWorld在视频逼真度、动作跟随能力和多视角一致性方面均优于基线方法。项目页面：https://multi-world.github.io/

English

Video world models have achieved remarkable success in simulating environmental dynamics in response to actions by users or agents. They are modeled as action-conditioned video generation models that take historical frames and current actions as input to predict future frames. Yet, most existing approaches are limited to single-agent scenarios and fail to capture the complex interactions inherent in real-world multi-agent systems. We present MultiWorld, a unified framework for multi-agent multi-view world modeling that enables accurate control of multiple agents while maintaining multi-view consistency. We introduce the Multi-Agent Condition Module to achieve precise multi-agent controllability, and the Global State Encoder to ensure coherent observations across different views. MultiWorld supports flexible scaling of agent and view counts, and synthesizes different views in parallel for high efficiency. Experiments on multi-player game environments and multi-robot manipulation tasks demonstrate that MultiWorld outperforms baselines in video fidelity, action-following ability, and multi-view consistency. Project page: https://multi-world.github.io/

多世界：可扩展的多智能体多视角视频世界模型

MultiWorld: Scalable Multi-Agent Multi-View Video World Models

摘要

Support