AHA-WAM:非同步視野自適應世界行動建模與觀察引導的上下文路由
AHA-WAM:Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing
June 8, 2026
作者: Jisong Cai, Long Ling, Shiwei Chu, Zhongshan Liu, Jiayue Kang, Zhixuan Liang, Wenjie Xu, Yinan Mao, Weinan Zhang, Xiaokang Yang, Ru Ying, Ran Zheng, Yao Mu
cs.AI
摘要
世界-动作模型已成為機器人操控的一個有前景的典範,它透過聯合建模視覺場景動態與動作,將物理先驗注入策略學習中。然而,現有的世界-動作模型將世界預測與動作執行耦合在同一時間解析度下,迫使世界分支去建模近期的幀變化,這些變化既冗餘又資訊量不足。我們認為,嚴格將世界預測與動作執行綁定在相同的時間節奏,可能會低估影片分支在具身控制中的潛力。因此,我們提出AHA-WAM,這是一個基於雙擴散Transformer(DiT)架構的非同步時域自適應世界-動作模型,它圍繞這種時間不對稱性重新組織了世界-動作建模。AHA-WAM將影片DiT實例化為一個低頻的世界規劃器,它維護對過去觀測的滾動鍵值記憶,並暴露可重用的分層潛在上下文,以編碼長時域場景演化;同時,一個高頻的動作DiT通過分層聯合注意力查詢該上下文,在閉環中執行短動作片段。為了支援非同步執行,我們引入了時域自適應偏移訓練和觀測引導的視頻上下文路由(OVCR),這兩者共同讓動作專家能夠利用長時域的世界上下文,同時保持對即時執行狀態的反應能力,而無需重新運行影片DiT。在RoboTwin和真實世界操控任務上的實驗表明,AHA-WAM在無需任何機器人數據預訓練的情況下達到了最佳性能,在RoboTwin上平均成功率為92.80%,在4個真實世界任務中成功率為78.3%,同時實現了24.17 Hz的閉環控制,比Fast-WAM提速4.59倍。
English
World-action models have emerged as a promising paradigm for robot manipulation, jointly modeling visual scene dynamics and actions to inject physical priors into policy learning. However, existing world-action models couple world prediction and action execution at the same temporal resolution, forcing the world branch to model near-term frame variations that are redundant and weakly informative. We posit that strictly binding world prediction and action execution to the same temporal rhythm may underutilize the potential of the video branch for embodied control. Therefore, we propose AHA-WAM, an Asynchronous Horizon-Adaptive World-Action Model built on a dual Diffusion Transformer (DiT) architecture that reorganizes world-action modeling around this temporal asymmetry. AHA-WAM instantiates the video DiT as a low-frequency world planner that maintains rolling key-value memory over past observations and exposes reusable layerwise latent context encoding long-horizon scene evolution, while a high-frequency action DiT executes short action chunks in closed loop by querying this context through layerwise joint attention. To support asynchronous execution, we introduce horizon-adaptive offset training and Observation-Guided Video-Context Routing (OVCR), which together let the action expert exploit long-horizon world context while remaining responsive to real-time execution state without rerunning the video DiT. Experiments on RoboTwin and real-world manipulation tasks show that AHA-WAM achieves state-of-the-art performance without any robot-data pretraining, attaining 92.80% average success on RoboTwin and 78.3% success across 4 real-world tasks, while reaching 24.17 Hz closed-loop control with a 4.59x speedup over Fast-WAM.