ActionParty: 生成型ビデオゲームにおけるマルチサブジェクト行動連携

要旨

近年のビデオ拡散モデルの進歩により、インタラクティブな環境をシミュレート可能な「世界モデル」の開発が可能となってきた。しかし、これらのモデルは主に単一エージェント設定に限定され、シーン内の複数エージェントを同時に制御することができない。本研究では、既存のビデオ拡散モデルにおける動作束縛の根本的な問題、すなわち特定の動作を対応する主体に関連付けることに苦労する問題に取り組む。この目的のために、生成型ビデオゲームのための動作制御可能な多主体世界モデルであるActionPartyを提案する。本モデルは、シーン内の各主体の状態を持続的に捕捉する潜在変数である、主体状態トークンを導入する。状態トークンとビデオ潜在表現を空間的バイアス機構で共同モデリングすることにより、グローバルなビデオフレームの描画と、個別の動作制御による主体の更新を分離する。ActionPartyをMelting Potベンチマークで評価し、46の多様な環境において最大7プレイヤーを同時に制御可能な初のビデオ世界モデルであることを実証する。結果は、複雑なインタラクションを通じた主体の堅牢な自己回帰的追跡を可能にしつつ、動作追従精度と同一性一貫性において大幅な改善を示している。

English

Recent advances in video diffusion have enabled the development of "world models" capable of simulating interactive environments. However, these models are largely restricted to single-agent settings, failing to control multiple agents simultaneously in a scene. In this work, we tackle a fundamental issue of action binding in existing video diffusion models, which struggle to associate specific actions with their corresponding subjects. For this purpose, we propose ActionParty, an action controllable multi-subject world model for generative video games. It introduces subject state tokens, i.e. latent variables that persistently capture the state of each subject in the scene. By jointly modeling state tokens and video latents with a spatial biasing mechanism, we disentangle global video frame rendering from individual action-controlled subject updates. We evaluate ActionParty on the Melting Pot benchmark, demonstrating the first video world model capable of controlling up to seven players simultaneously across 46 diverse environments. Our results show significant improvements in action-following accuracy and identity consistency, while enabling robust autoregressive tracking of subjects through complex interactions.

ActionParty: 生成型ビデオゲームにおけるマルチサブジェクト行動連携

ActionParty: Multi-Subject Action Binding in Generative Video Games

要旨

Support