ActionParty：生成式视频游戏中的多主体动作绑定

摘要

视频扩散技术的最新进展催生了能够模拟交互式环境的"世界模型"，但现有模型大多局限于单智能体场景，难以同时控制场景中的多个主体。本研究针对当前视频扩散模型中存在的动作绑定问题——即无法将特定动作与其对应主体准确关联的缺陷，提出了面向生成式视频游戏的动作可控多主体世界模型ActionParty。该模型通过引入主体状态标记（一种持续捕捉场景中各主体状态的潜变量），结合空间偏置机制对状态标记与视频潜空间进行联合建模，从而实现了全局视频帧渲染与个体动作控制下主体更新的解耦。我们在Melting Pot基准测试中评估ActionParty，首次展示了能够在46种不同环境中同时控制多达七个玩家的视频世界模型。实验结果表明，该模型在动作执行准确性和身份一致性方面显著提升，并能通过复杂交互实现稳健的自回归主体追踪。

English

Recent advances in video diffusion have enabled the development of "world models" capable of simulating interactive environments. However, these models are largely restricted to single-agent settings, failing to control multiple agents simultaneously in a scene. In this work, we tackle a fundamental issue of action binding in existing video diffusion models, which struggle to associate specific actions with their corresponding subjects. For this purpose, we propose ActionParty, an action controllable multi-subject world model for generative video games. It introduces subject state tokens, i.e. latent variables that persistently capture the state of each subject in the scene. By jointly modeling state tokens and video latents with a spatial biasing mechanism, we disentangle global video frame rendering from individual action-controlled subject updates. We evaluate ActionParty on the Melting Pot benchmark, demonstrating the first video world model capable of controlling up to seven players simultaneously across 46 diverse environments. Our results show significant improvements in action-following accuracy and identity consistency, while enabling robust autoregressive tracking of subjects through complex interactions.

ActionParty：生成式视频游戏中的多主体动作绑定

ActionParty: Multi-Subject Action Binding in Generative Video Games

摘要

Support