WildWorld:面向生成式动作角色扮演游戏的动态世界建模大规模数据集——含动作与显式状态
WildWorld: A Large-Scale Dataset for Dynamic World Modeling with Actions and Explicit State toward Generative ARPG
March 24, 2026
作者: Zhen Li, Zian Meng, Shuwei Shi, Wenshuo Peng, Yuwei Wu, Bo Zheng, Chuanhao Li, Kaipeng Zhang
cs.AI
摘要
动力学系统理论与强化学习将世界演化视为由动作驱动的潜在状态动态过程,视觉观测则提供关于状态的部分信息。近期视频世界模型尝试从数据中学习这种动作条件化的动态规律。然而现有数据集难以满足要求:通常缺乏多样化且具语义意义的动作空间,且动作直接与视觉观测绑定而非通过底层状态中介。这导致动作常与像素级变化纠缠,使模型难以学习结构化世界动态并保持长时域演化的一致性。本文提出WildWorld——一个具有显式状态标注的大规模动作条件化世界建模数据集,通过从照片级真实AAA动作角色扮演游戏(《怪物猎人:荒野》)自动采集而成。该数据集包含超1.08亿帧画面,涵盖移动、攻击、技能施放等450余种动作,并同步提供逐帧角色骨骼、世界状态、相机位姿与深度图标注。我们进一步构建WildBench评估框架,通过动作跟随与状态对齐两项任务评估模型性能。大量实验表明,在建模语义丰富的动作与保持长时域状态一致性方面仍存在持续挑战,凸显了状态感知视频生成的必要性。项目页面详见https://shandaai.github.io/wildworld-project/。
English
Dynamical systems theory and reinforcement learning view world evolution as latent-state dynamics driven by actions, with visual observations providing partial information about the state. Recent video world models attempt to learn this action-conditioned dynamics from data. However, existing datasets rarely match the requirement: they typically lack diverse and semantically meaningful action spaces, and actions are directly tied to visual observations rather than mediated by underlying states. As a result, actions are often entangled with pixel-level changes, making it difficult for models to learn structured world dynamics and maintain consistent evolution over long horizons. In this paper, we propose WildWorld, a large-scale action-conditioned world modeling dataset with explicit state annotations, automatically collected from a photorealistic AAA action role-playing game (Monster Hunter: Wilds). WildWorld contains over 108 million frames and features more than 450 actions, including movement, attacks, and skill casting, together with synchronized per-frame annotations of character skeletons, world states, camera poses, and depth maps. We further derive WildBench to evaluate models through Action Following and State Alignment. Extensive experiments reveal persistent challenges in modeling semantically rich actions and maintaining long-horizon state consistency, highlighting the need for state-aware video generation. The project page is https://shandaai.github.io/wildworld-project/.