ActWorld：從可探索到互動式世界模型——透過動作感知記憶

摘要

互動式世界模型旨在模擬在使用者即時操作下的環境動態。然而，其動作詞彙大多局限於導航：多數動作對應移動（如行走、轉向、環視），而與場景中物體的互動（如拿起盤子、開門或觸發物理反應）若非付之闕如，便是僅限於遊戲領域，或降級為提示生成完整影片的場景。這類模型所建構的世界雖可視覺探索，卻無法真正進行操作。在本研究中，我們提出 ActWorld，這是一個互動式世界模型，將先前以導航為中心的生成器擴展至支援區塊自迴歸框架中的中途物件互動。我們主張導航與互動之間的差距源於兩個瓶頸。首先，資料瓶頸：缺乏具備精確密集標註的人類與物件互動資料。其次，記憶瓶頸：現有世界模型中偏向近期性的歷史壓縮機制，會捨棄那些因果決定後續物件狀態的事件轉換幀，導致動作遺忘的病態現象。在資料方面，我們建構了一個包含 10 萬段互動影片的資料集，每段影片均透過鏈式思考推理附加區塊層級的文字描述。在模型方面，我們引入了一種分層動作感知記憶設計，依據互動重要性來引導歷史壓縮的路徑，並輔以持久記憶庫，在長時間推論過程中維護事件更新與物件身分標記。實驗顯示，ActWorld 能在單一模型中同時支援靈活的導航與豐富的物件互動，相較於僅具導航能力的基準模型，互動真實度顯著提升，且不犧牲視角控制能力。專案頁面請參閱 https://interactwm.github.io/ActWorld。

English

Interactive world models aim to simulate environment dynamics under real-time user actions. However, their action vocabulary is largely confined to navigation: most actions correspond to motion (e.g., walk, turn, look around), while interaction with objects in the scene (e.g., pick up plates, open doors, or trigger physical responses) is either absent, restricted to game domains, or relegated to prompt-to-full-video scenarios. The resulting worlds are visually explorable but not truly actionable. In this work, we present ActWorld, an interactive world model that extends prior navigation-centric generators to support mid-rollout object interaction within a chunk-autoregressive framework. We argue that the navigation-interaction gap stems from two bottlenecks. First, a data bottleneck: the lack of human-object interaction data with accurate, dense labels. Second, a memory bottleneck: recency-biased history compression in existing world models discards the event-transition frames that causally determine subsequent object states, leading to an action-forgetting pathology. On the data side, we construct a 100K interaction video dataset, each annotated with per-chunk captions via chain-of-thought reasoning. On the model side, we introduce a hierarchical action-aware memory design that routes history compression by interaction importance, complemented by a persistent memory bank that maintains event-update and object-identity tokens across long rollouts. Experiments show that ActWorld supports both flexible navigation and rich object interaction within a single model, substantially improving interaction fidelity over navigation-only baselines without sacrificing viewpoint control. Project page is available at https://interactwm.github.io/ActWorld.