ActWorld：通过动作感知记忆从可探索到交互的世界模型

摘要

交互式世界模型旨在模拟真实用户操作下的环境动态。然而，其动作词汇大多局限于导航：多数动作对应移动（如行走、转身、环视），而与场景中物体的交互（如拿起盘子、开门或触发物理响应）要么缺失，要么局限于游戏领域，或降级为提示-全视频场景。由此生成的世界虽可视觉探索，却缺乏真正的可操作性。本文提出的ActWorld是一种交互式世界模型，它将原有的以导航为核心的生成模型扩展至支持分块自回归框架下的中期展开物体交互。我们认为，导航与交互之间的鸿沟源于两大瓶颈。第一是数据瓶颈：缺乏带精确密集标注的人-物交互数据。第二是记忆瓶颈：现有世界模型中基于近因偏差的历史压缩机制会丢弃那些因果决定后续物体状态的事件过渡帧，引发动作遗忘病理。在数据层面，我们构建了包含10万段交互视频的数据集，每段视频均通过链式推理生成逐分块字幕。在模型层面，我们引入了分层动作感知记忆设计，根据交互重要性对历史压缩进行路由，并辅以持久记忆库，在长序列展开中维护事件更新与物体身份标记。实验表明，ActWorld可在单一模型中同时支持灵活导航与丰富物体交互，在不牺牲视角控制的前提下，交互保真度显著优于纯导航基线。项目页面详见https://interactwm.github.io/ActWorld。

English

Interactive world models aim to simulate environment dynamics under real-time user actions. However, their action vocabulary is largely confined to navigation: most actions correspond to motion (e.g., walk, turn, look around), while interaction with objects in the scene (e.g., pick up plates, open doors, or trigger physical responses) is either absent, restricted to game domains, or relegated to prompt-to-full-video scenarios. The resulting worlds are visually explorable but not truly actionable. In this work, we present ActWorld, an interactive world model that extends prior navigation-centric generators to support mid-rollout object interaction within a chunk-autoregressive framework. We argue that the navigation-interaction gap stems from two bottlenecks. First, a data bottleneck: the lack of human-object interaction data with accurate, dense labels. Second, a memory bottleneck: recency-biased history compression in existing world models discards the event-transition frames that causally determine subsequent object states, leading to an action-forgetting pathology. On the data side, we construct a 100K interaction video dataset, each annotated with per-chunk captions via chain-of-thought reasoning. On the model side, we introduce a hierarchical action-aware memory design that routes history compression by interaction importance, complemented by a persistent memory bank that maintains event-update and object-identity tokens across long rollouts. Experiments show that ActWorld supports both flexible navigation and rich object interaction within a single model, substantially improving interaction fidelity over navigation-only baselines without sacrificing viewpoint control. Project page is available at https://interactwm.github.io/ActWorld.