멀티월드: 확장 가능한 다중 에이전트 다중 뷰 비디오 월드 모델

초록

비디오 월드 모델은 사용자 또는 에이전트의 행동에 대한 환경 역학 시뮬레이션에서 놀라운 성공을 거두었습니다. 이는 역사 프레임과 현재 행동을 입력으로 받아 미래 프레임을 예측하는 행동 조건 비디오 생성 모델로 구현됩니다. 그러나 기존 대부분의 접근법은 단일 에이전트 시나리오로 제한되어 있으며, 실제 세계의 다중 에이전트 시스템에 내재된 복잡한 상호작용을 포착하지 못합니다. 본 논문에서는 다중 에이전트의 정확한 제어와 다중 뷰 일관성 유지를 가능하게 하는 통합 다중 에이전트 다중 뷰 월드 모델링 프레임워크인 MultiWorld를 제안합니다. 정밀한 다중 에이전트 제어를 위해 다중 에이전트 조건 모듈을 도입하고, 서로 다른 뷰 간의 일관된 관측을 보장하기 위해 글로벌 상태 인코더를 설계했습니다. MultiWorld는 에이전트 및 뷰 수의 유연한 확장을 지원하며 효율성을 위해 다중 뷰를 병렬로 합성합니다. 다중 플레이어 게임 환경과 다중 로봇 조작 작업에 대한 실험을 통해 MultiWorld가 비디오 품질, 행동 추종 능력, 다중 뷰 일관성에서 기준선보다 우수한 성능을 보임을 입증했습니다. 프로젝트 페이지: https://multi-world.github.io/

English

Video world models have achieved remarkable success in simulating environmental dynamics in response to actions by users or agents. They are modeled as action-conditioned video generation models that take historical frames and current actions as input to predict future frames. Yet, most existing approaches are limited to single-agent scenarios and fail to capture the complex interactions inherent in real-world multi-agent systems. We present MultiWorld, a unified framework for multi-agent multi-view world modeling that enables accurate control of multiple agents while maintaining multi-view consistency. We introduce the Multi-Agent Condition Module to achieve precise multi-agent controllability, and the Global State Encoder to ensure coherent observations across different views. MultiWorld supports flexible scaling of agent and view counts, and synthesizes different views in parallel for high efficiency. Experiments on multi-player game environments and multi-robot manipulation tasks demonstrate that MultiWorld outperforms baselines in video fidelity, action-following ability, and multi-view consistency. Project page: https://multi-world.github.io/

멀티월드: 확장 가능한 다중 에이전트 다중 뷰 비디오 월드 모델

MultiWorld: Scalable Multi-Agent Multi-View Video World Models

초록

Support