ActionParty: 생성형 비디오 게임에서의 다중 주체 액션 바인딩

초록

최근 비디오 확산 모델의 발전으로 상호작용이 가능한 환경을 시뮬레이션하는 "월드 모델" 개발이 가능해졌습니다. 그러나 이러한 모델은 대부분 단일 에이전트 설정에 국한되어, 한 장면에서 여러 에이전트를 동시에 제어하지 못하는 한계가 있습니다. 본 연구에서는 특정 행동을 해당 주체에 연결하는 데 어려움을 겪는 기존 비디오 확산 모델의 근본적인 문제인 행동 바인딩 문제를 해결합니다. 이를 위해 생성형 비디오 게임을 위한 행동 제어 가능 다중 주체 월드 모델인 ActionParty를 제안합니다. ActionParty는 장면 내 각 주체의 상태를 지속적으로 포착하는 잠재 변수, 즉 주체 상태 토큰을 도입합니다. 공간 편향 메커니즘을 통해 상태 토큰과 비디오 잠재 변수를 공동으로 모델링함으로써, 전역 비디오 프레임 렌더링과 개별 행동 제어 주체 업데이트를 분리합니다. 우리는 ActionParty를 Melting Pot 벤치마크에서 평가하며, 46가지 다양한 환경에서 최대 7명의 플레이어를 동시에 제어할 수 있는 최초의 비디오 월드 모델을 입증합니다. 실험 결과, 행동 준수 정확도와 주체 일관성이 크게 향상되었으며, 복잡한 상호작용을 통한 강건한 자기회귀적 주체 추적이 가능함을 보여줍니다.

English

Recent advances in video diffusion have enabled the development of "world models" capable of simulating interactive environments. However, these models are largely restricted to single-agent settings, failing to control multiple agents simultaneously in a scene. In this work, we tackle a fundamental issue of action binding in existing video diffusion models, which struggle to associate specific actions with their corresponding subjects. For this purpose, we propose ActionParty, an action controllable multi-subject world model for generative video games. It introduces subject state tokens, i.e. latent variables that persistently capture the state of each subject in the scene. By jointly modeling state tokens and video latents with a spatial biasing mechanism, we disentangle global video frame rendering from individual action-controlled subject updates. We evaluate ActionParty on the Melting Pot benchmark, demonstrating the first video world model capable of controlling up to seven players simultaneously across 46 diverse environments. Our results show significant improvements in action-following accuracy and identity consistency, while enabling robust autoregressive tracking of subjects through complex interactions.

ActionParty: 생성형 비디오 게임에서의 다중 주체 액션 바인딩

ActionParty: Multi-Subject Action Binding in Generative Video Games

초록

Support