AHA-WAM: 비동기적 지평 적응형 세계-행동 모델링 - 관측 기반 컨텍스트 라우팅을 통한

초록

세계-행동 모델은 로봇 조작을 위한 유망한 패러다임으로 부상하여, 시각적 장면 동역학과 행동을 함께 모델링함으로써 정책 학습에 물리적 사전 지식을 주입한다. 그러나 기존의 세계-행동 모델은 동일한 시간 해상도로 세계 예측과 행동 실행을 결합하여, 세계 분기가 중복되고 정보량이 적은 단기 프레임 변동을 모델링하도록 강제한다. 본 연구는 세계 예측과 행동 실행을 동일한 시간적 리듬에 엄격히 구속하는 것이 체현 제어를 위한 비디오 분기의 잠재력을 충분히 활용하지 못할 수 있다고 주장한다. 따라서 우리는 이러한 시간적 비대칭성을 중심으로 세계-행동 모델링을 재구성하는 이중 확산 트랜스포머(Dual Diffusion Transformer, DiT) 아키텍처 기반의 AHA-WAM(Asynchronous Horizon-Adaptive World-Action Model)을 제안한다. AHA-WAM은 비디오 DiT를 저주파 세계 계획자(low-frequency world planner)로 구현하여, 과거 관측에 대한 순환 키-값 메모리를 유지하고 장기 장면 진화를 인코딩하는 재사용 가능한 계층별 잠재 컨텍스트를 노출시키는 동시에, 고주파 행동 DiT는 계층별 결합 주의(layerwise joint attention)를 통해 이 컨텍스트를 질의함으로써 폐쇄 루프로 짧은 행동 청크를 실행한다. 비동기 실행을 지원하기 위해, 우리는 수평선 적응 오프셋 훈련(horizon-adaptive offset training)과 관측 유도 비디오-컨텍스트 라우팅(Observation-Guided Video-Context Routing, OVCR)을 도입하여, 행동 전문가가 비디오 DiT를 재실행하지 않고도 장기 세계 컨텍스트를 활용하면서 실시간 실행 상태에 반응할 수 있도록 한다. RoboTwin 및 실제 조작 작업에 대한 실험 결과, AHA-WAM은 로봇 데이터 사전 학습 없이도 최첨단 성능을 달성하여 RoboTwin에서 평균 성공률 92.80%, 4가지 실제 작업에서 78.3%의 성공률을 기록했으며, 24.17Hz의 폐쇄 루프 제어를 통해 Fast-WAM 대비 4.59배의 속도 향상을 보였다.

English

World-action models have emerged as a promising paradigm for robot manipulation, jointly modeling visual scene dynamics and actions to inject physical priors into policy learning. However, existing world-action models couple world prediction and action execution at the same temporal resolution, forcing the world branch to model near-term frame variations that are redundant and weakly informative. We posit that strictly binding world prediction and action execution to the same temporal rhythm may underutilize the potential of the video branch for embodied control. Therefore, we propose AHA-WAM, an Asynchronous Horizon-Adaptive World-Action Model built on a dual Diffusion Transformer (DiT) architecture that reorganizes world-action modeling around this temporal asymmetry. AHA-WAM instantiates the video DiT as a low-frequency world planner that maintains rolling key-value memory over past observations and exposes reusable layerwise latent context encoding long-horizon scene evolution, while a high-frequency action DiT executes short action chunks in closed loop by querying this context through layerwise joint attention. To support asynchronous execution, we introduce horizon-adaptive offset training and Observation-Guided Video-Context Routing (OVCR), which together let the action expert exploit long-horizon world context while remaining responsive to real-time execution state without rerunning the video DiT. Experiments on RoboTwin and real-world manipulation tasks show that AHA-WAM achieves state-of-the-art performance without any robot-data pretraining, attaining 92.80% average success on RoboTwin and 78.3% success across 4 real-world tasks, while reaching 24.17 Hz closed-loop control with a 4.59x speedup over Fast-WAM.