ActionParty：生成式电子游戏中的多主体动作绑定

摘要

近期视频扩散技术的进步催生了能够模拟交互式环境的"世界模型"发展。然而，现有模型大多局限于单智能体场景，无法同时控制场景中的多个智能体。本研究针对现有视频扩散模型中的动作绑定这一根本性问题展开研究——这些模型难以将特定动作与其对应主体建立关联。为此，我们提出ActionParty，一种面向生成式视频游戏的可控多主体动作世界模型。该模型引入了主体状态标记，即持续捕捉场景中每个主体状态的潜变量。通过结合空间偏置机制联合建模状态标记与视频潜变量，我们实现了全局视频帧渲染与个体动作控制的主体更新之间的解耦。我们在Melting Pot基准测试上评估ActionParty，首次展示了能够在46种多样化环境中同时控制多达七个玩家的视频世界模型。实验结果表明，该方法在动作跟随精度和身份一致性方面取得显著提升，同时能够通过复杂交互实现稳健的自回归主体追踪。

English

Recent advances in video diffusion have enabled the development of "world models" capable of simulating interactive environments. However, these models are largely restricted to single-agent settings, failing to control multiple agents simultaneously in a scene. In this work, we tackle a fundamental issue of action binding in existing video diffusion models, which struggle to associate specific actions with their corresponding subjects. For this purpose, we propose ActionParty, an action controllable multi-subject world model for generative video games. It introduces subject state tokens, i.e. latent variables that persistently capture the state of each subject in the scene. By jointly modeling state tokens and video latents with a spatial biasing mechanism, we disentangle global video frame rendering from individual action-controlled subject updates. We evaluate ActionParty on the Melting Pot benchmark, demonstrating the first video world model capable of controlling up to seven players simultaneously across 46 diverse environments. Our results show significant improvements in action-following accuracy and identity consistency, while enabling robust autoregressive tracking of subjects through complex interactions.

ActionParty：生成式电子游戏中的多主体动作绑定

ActionParty: Multi-Subject Action Binding in Generative Video Games

摘要

Support