何時信賴想像力:世界行動模型的自適應行動執行
When to Trust Imagination: Adaptive Action Execution for World Action Models
May 7, 2026
作者: Rui Wang, Yue Zhang, Jiehong Lin, Kuncheng Luo, Jianan Wang, Zhongrui Wang, Xiaojuan Qi
cs.AI
摘要
世界行動模型(WAMs)近期已成為機器人操作領域中極具前景的範式,其特點在於能同步預測未來的視覺觀測結果與動作序列。然而,現有WAMs通常在每次模型推理後執行固定數量的預測動作,這使得機器人無法判斷其想像的未來是否與實際物理執行過程保持一致。本研究將自適應WAM執行定義為未來-現實驗證問題:當WAM預測的未來保持可靠時,機器人應延長執行時長;當現實偏離想像時,則應提前重新規劃。為此,我們提出未來前向動力學因果注意力(FFDC)——一種輕量級驗證器,能聯合推理預測的未來動作、預測的視覺動力學、真實觀測數據及語言指令,從而評估剩餘動作序列的可信度。FFDC通過預測與觀測一致性實現自適應動作塊大小的湧現效果,在保持長時程執行效率的同時,恢復機器人在密集接觸或高難度階段的響應能力。我們進一步引入混合時域訓練法,以增強自適應執行對長時程軌跡的覆蓋能力。在RoboTwin基準測試與真實環境中的實驗表明,本方法實現了強健的魯棒性-效率平衡:在RoboTwin上,相較短時塊基準方法,WAM前向傳遞次數減少69.10%,執行時間降低34.02%,成功率提升2.54%;在真實環境實驗中,成功率更提升35%。
English
World Action Models (WAMs) have recently emerged as a promising paradigm for robotic manipulation by jointly predicting future visual observations and future actions. However, current WAMs typically execute a fixed number of predicted actions after each model inference, leaving the robot blind to whether the imagined future remains consistent with the actual physical rollout. In this work, we formulate adaptive WAM execution as a future-reality verification problem: the robot should execute longer when the WAM-predicted future remains reliable, and replan earlier when reality deviates from imagination. To this end, we propose Future Forward Dynamics Causal Attention (FFDC), a lightweight verifier that jointly reasons over predicted future actions, predicted visual dynamics, real observations, and language instructions to estimate whether the remaining action rollout can still be trusted. FFDC enables adaptive action chunk sizes as an emergent consequence of prediction-observation consistency, preserving the efficiency of long-horizon execution while restoring responsiveness in contact-rich or difficult phases. We further introduce Mixture-of-Horizon Training to improve long-horizon trajectory coverage for adaptive execution. Experiments on the RoboTwin benchmark and in the real world demonstrate that our method achieves a strong robustness-efficiency trade-off: on RoboTwin, it reduces WAM forward passes by 69.10% and execution time by 34.02%, while improving success rate by 2.54% over the short-chunk baseline; in real-world experiments, it improves success rate by 35%.