ChatPaper.aiChatPaper

WildWorld:面向生成式動作角色扮演遊戲的動態世界建模大規模資料集(含動作與顯式狀態)

WildWorld: A Large-Scale Dataset for Dynamic World Modeling with Actions and Explicit State toward Generative ARPG

March 24, 2026
作者: Zhen Li, Zian Meng, Shuwei Shi, Wenshuo Peng, Yuwei Wu, Bo Zheng, Chuanhao Li, Kaipeng Zhang
cs.AI

摘要

動力系統理論與強化學習將世界演化視為由動作驅動的潛在狀態動態變化,而視覺觀測則提供關於狀態的部分資訊。近期影片世界模型試圖從數據中學習這種動作條件化的動態。然而現有數據集難以滿足需求:通常缺乏多樣化且具語義意義的動作空間,且動作直接與視覺觀測綁定,而非透過潛在狀態中介。這導致動作常與像素級變化糾纏,使模型難以學習結構化世界動態並維持長時序的一致性演化。本文提出WildWorld——一個具顯式狀態標註的大規模動作條件化世界建模數據集,通過從寫實級3A動作角色扮演遊戲(魔物獵人:Wilds)自動採集而成。WildWorld包含逾1.08億幀畫面,具備450餘種動作(含移動、攻擊、技能施放),並同步提供每幀的角色骨骼、世界狀態、相機位姿與深度圖標註。我們進一步構建WildBench評估框架,透過動作追蹤與狀態對齊兩項任務評測模型。大量實驗揭示在建模語義豐富的動作與維持長時序狀態一致性方面仍存在持續挑戰,凸顯具狀態感知的影片生成之必要性。項目頁面詳見 https://shandaai.github.io/wildworld-project/。
English
Dynamical systems theory and reinforcement learning view world evolution as latent-state dynamics driven by actions, with visual observations providing partial information about the state. Recent video world models attempt to learn this action-conditioned dynamics from data. However, existing datasets rarely match the requirement: they typically lack diverse and semantically meaningful action spaces, and actions are directly tied to visual observations rather than mediated by underlying states. As a result, actions are often entangled with pixel-level changes, making it difficult for models to learn structured world dynamics and maintain consistent evolution over long horizons. In this paper, we propose WildWorld, a large-scale action-conditioned world modeling dataset with explicit state annotations, automatically collected from a photorealistic AAA action role-playing game (Monster Hunter: Wilds). WildWorld contains over 108 million frames and features more than 450 actions, including movement, attacks, and skill casting, together with synchronized per-frame annotations of character skeletons, world states, camera poses, and depth maps. We further derive WildBench to evaluate models through Action Following and State Alignment. Extensive experiments reveal persistent challenges in modeling semantically rich actions and maintaining long-horizon state consistency, highlighting the need for state-aware video generation. The project page is https://shandaai.github.io/wildworld-project/.
PDF661March 26, 2026