상상력을 신뢰할 때: 세계 행동 모델을 위한 적응형 행동 실행

초록

월드 액션 모델(WAM)은 최근 미래의 시각적 관측과 미래 행동을 함께 예측함으로써 로봇 매니픽레이션 분야에서 유망한 패러다임으로 부상하고 있습니다. 그러나 기존 WAM은 일반적으로 각 모델 추론 후 고정된 횟수의 예측된 행동을 실행하며, 이로 인해 상상된 미래가 실제 물리적 실행과 여전히 일관성을 유지하는지 여부를 로봇이 인지하지 못하는 한계가 있습니다. 본 연구에서는 적응형 WAM 실행을 미래-현실 검증 문제로 공식화합니다. 즉, 로봇은 WAM이 예측한 미래가 신뢰할 수 있을 때는 더 오래 실행하고, 현실이 예상과 벗어날 때는 더 일찍 재계획해야 합니다. 이를 위해 우리는 예측된 미래 행동, 예측된 시각적 역학, 실제 관측, 언어 명령을 종합적으로 추론하여 남은 행동 실행이 여전히 신뢰할 수 있는지 여부를 추정하는 경량 검증기인 Future Forward Dynamics Causal Attention(FFDC)을 제안합니다. FFDC는 예측-관측 일관성의 자연스러운 결과로 적응형 행청크 크기를 가능하게 하여, 장기간 실행의 효율성을 유지하면서 접촉이 빈번하거나 어려운 단계에서의 반응성을 회복합니다. 또한 적응형 실행을 위한 장기 궤적 커버리지를 개선하기 위해 Mixture-of-Horizon Training을 도입합니다. RoboTwin 벤치마크와 실제 환경에서의 실험을 통해 우리 방법이 강력한 강건성-효율성 균형을 달성함을 입증했습니다. RoboTwin에서 본 방법은 단기 청크 베이스라인 대비 WAM 순전파 횟수를 69.10%, 실행 시간을 34.02% 줄이면서 성공률을 2.54% 향상시켰으며, 실제 실험에서는 성공률을 35% 향상시켰습니다.

English

World Action Models (WAMs) have recently emerged as a promising paradigm for robotic manipulation by jointly predicting future visual observations and future actions. However, current WAMs typically execute a fixed number of predicted actions after each model inference, leaving the robot blind to whether the imagined future remains consistent with the actual physical rollout. In this work, we formulate adaptive WAM execution as a future-reality verification problem: the robot should execute longer when the WAM-predicted future remains reliable, and replan earlier when reality deviates from imagination. To this end, we propose Future Forward Dynamics Causal Attention (FFDC), a lightweight verifier that jointly reasons over predicted future actions, predicted visual dynamics, real observations, and language instructions to estimate whether the remaining action rollout can still be trusted. FFDC enables adaptive action chunk sizes as an emergent consequence of prediction-observation consistency, preserving the efficiency of long-horizon execution while restoring responsiveness in contact-rich or difficult phases. We further introduce Mixture-of-Horizon Training to improve long-horizon trajectory coverage for adaptive execution. Experiments on the RoboTwin benchmark and in the real world demonstrate that our method achieves a strong robustness-efficiency trade-off: on RoboTwin, it reduces WAM forward passes by 69.10% and execution time by 34.02%, while improving success rate by 2.54% over the short-chunk baseline; in real-world experiments, it improves success rate by 35%.

상상력을 신뢰할 때: 세계 행동 모델을 위한 적응형 행동 실행

When to Trust Imagination: Adaptive Action Execution for World Action Models

초록

Support