MultiWorld: スケーラブルなマルチエージェント・マルチビュー映像ワールドモデル

要旨

ビデオ世界モデルは、ユーザーやエージェントの行動に対する環境のダイナミクスをシミュレートする分野で顕著な成功を収めています。これらは、履歴フレームと現在の行動を入力として受け取り、将来のフレームを予測する行動条件付きビデオ生成モデルとして構築されます。しかし、既存手法の多くは単一エージェントのシナリオに限定されており、実世界のマルチエージェントシステムに内在する複雑な相互作用を捉えることができません。本論文では、マルチエージェントのマルチビュー世界モデリングのための統一フレームワークであるMultiWorldを提案します。本手法は、マルチビュー一貫性を維持しつつ、複数エージェントの正確な制御を可能にします。精密なマルチエージェント制御性を実現するためのマルチエージェント条件モジュールと、異なるビュー間で一貫した観測を保証するグローバル状態エンコーダを導入します。MultiWorldはエージェント数とビュー数の柔軟なスケーリングをサポートし、高効率のために異なるビューの合成を並列処理します。マルチプレイヤーゲーム環境とマルチロボット操作タスクにおける実験により、MultiWorldがビデオの忠実度、行動追従能力、マルチビュー一貫性においてベースライン手法を上回ることを実証します。プロジェクトページ: https://multi-world.github.io/

English

Video world models have achieved remarkable success in simulating environmental dynamics in response to actions by users or agents. They are modeled as action-conditioned video generation models that take historical frames and current actions as input to predict future frames. Yet, most existing approaches are limited to single-agent scenarios and fail to capture the complex interactions inherent in real-world multi-agent systems. We present MultiWorld, a unified framework for multi-agent multi-view world modeling that enables accurate control of multiple agents while maintaining multi-view consistency. We introduce the Multi-Agent Condition Module to achieve precise multi-agent controllability, and the Global State Encoder to ensure coherent observations across different views. MultiWorld supports flexible scaling of agent and view counts, and synthesizes different views in parallel for high efficiency. Experiments on multi-player game environments and multi-robot manipulation tasks demonstrate that MultiWorld outperforms baselines in video fidelity, action-following ability, and multi-view consistency. Project page: https://multi-world.github.io/

MultiWorld: スケーラブルなマルチエージェント・マルチビュー映像ワールドモデル

MultiWorld: Scalable Multi-Agent Multi-View Video World Models

要旨

Support