何时信任想象:世界行动模型的自适应行动执行
When to Trust Imagination: Adaptive Action Execution for World Action Models
May 7, 2026
作者: Rui Wang, Yue Zhang, Jiehong Lin, Kuncheng Luo, Jianan Wang, Zhongrui Wang, Xiaojuan Qi
cs.AI
摘要
世界行动模型(WAMs)近期作为一种机器人操作的新范式崭露头角,其通过联合预测未来视觉观测与未来动作来实现操控。然而,现有WAMs通常在每次模型推理后执行固定数量的预测动作,导致机器人无法感知想象未来是否与实际物理执行过程保持一致。本研究将自适应WAM执行构建为未来-现实验证问题:当WAM预测的未来保持可靠时,机器人应延长执行时长;而当现实偏离想象时,则需提前重新规划。为此,我们提出未来前向动力学因果注意力机制(FFDC),该轻量级验证器能够联合推理预测的未来动作、预测的视觉动态、实时观测及语言指令,以评估剩余动作序列是否仍可被信任。FFDC通过预测-观测一致性自然涌现出自适应动作块大小,既保持了长时程执行的效率,又在接触密集或困难阶段恢复了响应能力。我们还引入混合时域训练策略,以提升自适应执行中对长时程轨迹的覆盖能力。在RoboTwin基准测试和真实环境中的实验表明,本方法实现了强健的鲁棒性-效率平衡:在RoboTwin上,相较短时块基线方法,WAM前向传播次数减少69.10%,执行时间缩短34.02%,同时成功率提升2.54%;在真实世界实验中,成功率提高35%。
English
World Action Models (WAMs) have recently emerged as a promising paradigm for robotic manipulation by jointly predicting future visual observations and future actions. However, current WAMs typically execute a fixed number of predicted actions after each model inference, leaving the robot blind to whether the imagined future remains consistent with the actual physical rollout. In this work, we formulate adaptive WAM execution as a future-reality verification problem: the robot should execute longer when the WAM-predicted future remains reliable, and replan earlier when reality deviates from imagination. To this end, we propose Future Forward Dynamics Causal Attention (FFDC), a lightweight verifier that jointly reasons over predicted future actions, predicted visual dynamics, real observations, and language instructions to estimate whether the remaining action rollout can still be trusted. FFDC enables adaptive action chunk sizes as an emergent consequence of prediction-observation consistency, preserving the efficiency of long-horizon execution while restoring responsiveness in contact-rich or difficult phases. We further introduce Mixture-of-Horizon Training to improve long-horizon trajectory coverage for adaptive execution. Experiments on the RoboTwin benchmark and in the real world demonstrate that our method achieves a strong robustness-efficiency trade-off: on RoboTwin, it reduces WAM forward passes by 69.10% and execution time by 34.02%, while improving success rate by 2.54% over the short-chunk baseline; in real-world experiments, it improves success rate by 35%.