AHA-WAM: 异步自适应时域世界-动作建模与观测引导的上下文路由

摘要

世界-动作模型已成为机器人操作领域一种前景广阔的范式，通过联合建模视觉场景动态与动作，将物理先验注入策略学习。然而，现有世界-动作模型将世界预测与动作执行耦合在同一时间分辨率下，迫使世界分支对短期帧变化进行建模，而这些变化往往冗余且信息量有限。我们认为，严格将世界预测与动作执行绑定在同一时间节奏中，可能未能充分释放视频分支在具身控制中的潜力。为此，我们提出AHA-WAM（异步视界自适应世界-动作模型），该模型基于双扩散Transformer（DiT）架构，围绕这一时间非对称性重构世界-动作建模。AHA-WAM将视频DiT实例化为低频世界规划器，维护过往观测的滚动键值记忆，并暴露可复用的层级潜上下文以编码长视界场景演化；同时，高频动作DiT通过层级联合注意力查询该上下文，以闭环方式执行短动作块。为支持异步执行，我们引入了视界自适应偏移训练与观测引导的视频上下文路由（OVCR），二者协同使动作专家既能利用长视界世界上下文，又能保持对实时执行状态的响应，且无需重新运行视频DiT。在RoboTwin和真实世界操作任务上的实验表明，AHA-WAM无需任何机器人数据预训练即达到最先进性能：在RoboTwin上平均成功率达92.80%，在4个真实任务上成功率达78.3%，同时实现24.17Hz的闭环控制速度，较Fast-WAM加速4.59倍。

English

World-action models have emerged as a promising paradigm for robot manipulation, jointly modeling visual scene dynamics and actions to inject physical priors into policy learning. However, existing world-action models couple world prediction and action execution at the same temporal resolution, forcing the world branch to model near-term frame variations that are redundant and weakly informative. We posit that strictly binding world prediction and action execution to the same temporal rhythm may underutilize the potential of the video branch for embodied control. Therefore, we propose AHA-WAM, an Asynchronous Horizon-Adaptive World-Action Model built on a dual Diffusion Transformer (DiT) architecture that reorganizes world-action modeling around this temporal asymmetry. AHA-WAM instantiates the video DiT as a low-frequency world planner that maintains rolling key-value memory over past observations and exposes reusable layerwise latent context encoding long-horizon scene evolution, while a high-frequency action DiT executes short action chunks in closed loop by querying this context through layerwise joint attention. To support asynchronous execution, we introduce horizon-adaptive offset training and Observation-Guided Video-Context Routing (OVCR), which together let the action expert exploit long-horizon world context while remaining responsive to real-time execution state without rerunning the video DiT. Experiments on RoboTwin and real-world manipulation tasks show that AHA-WAM achieves state-of-the-art performance without any robot-data pretraining, attaining 92.80% average success on RoboTwin and 78.3% success across 4 real-world tasks, while reaching 24.17 Hz closed-loop control with a 4.59x speedup over Fast-WAM.