AHA-WAM: Asynchrone horizon-adaptieve wereld-actie-modellering met observatiegestuurde contextroutering

Samenvatting

Wereld-actiemodellen zijn naar voren gekomen als een veelbelovend paradigma voor robotmanipulatie, waarbij visuele scènedynamiek en acties gezamenlijk worden gemodelleerd om fysieke voorkennis in te brengen in het aanleren van beleid. Echter, bestaande wereld-actiemodellen koppelen wereldvoorspelling en actie-uitvoering op dezelfde temporele resolutie, waardoor de wereldtak wordt gedwongen om nabije framevariaties te modelleren die redundant en weinig informatief zijn. Wij stellen dat het strikt binden van wereldvoorspelling en actie-uitvoering aan hetzelfde temporele ritme het potentieel van de videotak voor belichaamde controle mogelijk onderbenut. Daarom introduceren we AHA-WAM, een Asynchroon Horizon-Adaptief Wereld-Actiemodel gebouwd op een dubbele Diffusie-Transformer (DiT) architectuur die wereld-actiemodellering reorganiseert rond deze temporele asymmetrie. AHA-WAM instantieert de video DiT als een laagfrequente wereldplanner die een rollend key-value geheugen bijhoudt over eerdere observaties en herbruikbare laagsgewijze latente context blootlegt die langetermijn scène-evolutie codeert, terwijl een hoogfrequente actie DiT korte actiebrokken uitvoert in een gesloten lus door deze context te bevragen via laagsgewijze gezamenlijke aandacht. Om asynchrone uitvoering te ondersteunen, introduceren we horizon-adaptieve offsettraining en Observatie-Gestuurde Video-Context Routing (OVCR), die samen de actie-expert in staat stellen om langetermijn wereldcontext te benutten terwijl deze responsief blijft op de real-time uitvoeringstoestand zonder de video DiT opnieuw uit te voeren. Experimenten op RoboTwin en real-world manipulatietaken tonen aan dat AHA-WAM state-of-the-art prestaties behaalt zonder enige voorafgaande training op robotdata, met een gemiddeld succes van 92,80% op RoboTwin en 78,3% succes over 4 real-world taken, terwijl het een gesloten-lusregeling van 24,17 Hz bereikt met een versnelling van 4,59x ten opzichte van Fast-WAM.

English

World-action models have emerged as a promising paradigm for robot manipulation, jointly modeling visual scene dynamics and actions to inject physical priors into policy learning. However, existing world-action models couple world prediction and action execution at the same temporal resolution, forcing the world branch to model near-term frame variations that are redundant and weakly informative. We posit that strictly binding world prediction and action execution to the same temporal rhythm may underutilize the potential of the video branch for embodied control. Therefore, we propose AHA-WAM, an Asynchronous Horizon-Adaptive World-Action Model built on a dual Diffusion Transformer (DiT) architecture that reorganizes world-action modeling around this temporal asymmetry. AHA-WAM instantiates the video DiT as a low-frequency world planner that maintains rolling key-value memory over past observations and exposes reusable layerwise latent context encoding long-horizon scene evolution, while a high-frequency action DiT executes short action chunks in closed loop by querying this context through layerwise joint attention. To support asynchronous execution, we introduce horizon-adaptive offset training and Observation-Guided Video-Context Routing (OVCR), which together let the action expert exploit long-horizon world context while remaining responsive to real-time execution state without rerunning the video DiT. Experiments on RoboTwin and real-world manipulation tasks show that AHA-WAM achieves state-of-the-art performance without any robot-data pretraining, attaining 92.80% average success on RoboTwin and 78.3% success across 4 real-world tasks, while reaching 24.17 Hz closed-loop control with a 4.59x speedup over Fast-WAM.