想像力を信じる時：世界行動モデルのための適応的行動実行

要旨

World Action Models (WAM) は、将来の視覚観測と将来の行動を共同で予測することで、ロボットマニピュレーションの有望なパラダイムとして最近注目を集めている。しかし、現在のWAMは通常、各モデル推論後に固定数の予測行動を実行するため、想像された将来が実際の物理的な展開と一致しているかどうかをロボットが認識できないままとなる。本研究では、適応型WAM実行を将来-現実検証問題として定式化する：WAMが予測する将来の信頼性が高い場合はロボットはより長く動作を継続し、現実が想像から逸脱した場合は早期に再計画すべきである。この目的のために、我々はFuture Forward Dynamics Causal Attention (FFDC) を提案する。これは、予測された将来の行動、予測された視覚的ダイナミクス、実際の観測、および言語指示を共同で推論し、残りの行動展開が依然として信頼できるかどうかを推定する軽量な検証器である。FFDCは、予測と観測の一貫性に基づいて適応的な行動チャンクサイズを実現し、長期的な実行の効率性を維持しながら、接触が頻繁な局面や困難な段階での応答性を回復させる。さらに、適応的実行のための長期的軌道カバレッジを改善するために、Mixture-of-Horizon Trainingを導入する。RoboTwinベンチマークおよび実世界での実験により、本手法が強力なロバスト性と効率性のトレードオフを達成することを実証した：RoboTwinでは、WAMの前方パスを69.10%、実行時間を34.02%削減し、ショートチャンクベースラインと比較して成功率を2.54%向上させた；実世界実験では、成功率を35%向上させた。

English

World Action Models (WAMs) have recently emerged as a promising paradigm for robotic manipulation by jointly predicting future visual observations and future actions. However, current WAMs typically execute a fixed number of predicted actions after each model inference, leaving the robot blind to whether the imagined future remains consistent with the actual physical rollout. In this work, we formulate adaptive WAM execution as a future-reality verification problem: the robot should execute longer when the WAM-predicted future remains reliable, and replan earlier when reality deviates from imagination. To this end, we propose Future Forward Dynamics Causal Attention (FFDC), a lightweight verifier that jointly reasons over predicted future actions, predicted visual dynamics, real observations, and language instructions to estimate whether the remaining action rollout can still be trusted. FFDC enables adaptive action chunk sizes as an emergent consequence of prediction-observation consistency, preserving the efficiency of long-horizon execution while restoring responsiveness in contact-rich or difficult phases. We further introduce Mixture-of-Horizon Training to improve long-horizon trajectory coverage for adaptive execution. Experiments on the RoboTwin benchmark and in the real world demonstrate that our method achieves a strong robustness-efficiency trade-off: on RoboTwin, it reduces WAM forward passes by 69.10% and execution time by 34.02%, while improving success rate by 2.54% over the short-chunk baseline; in real-world experiments, it improves success rate by 35%.

想像力を信じる時：世界行動モデルのための適応的行動実行

When to Trust Imagination: Adaptive Action Execution for World Action Models

要旨

Support