AHA-WAM：観測誘導コンテキストルーティングを用いた非同期地平線適応型世界行動モデリング

要旨

ワールド・アクションモデルはロボット操作における有望なパラダイムとして登場し、視覚シーンのダイナミクスと行動を同時にモデル化することで、ポリシー学習に物理的な事前知識を注入する。しかし、既存のワールド・アクションモデルは、世界予測と行動実行を同じ時間分解能で結合しており、その結果、世界ブランチが冗長で情報量の少ない近未来のフレーム変動をモデル化することを強いられている。我々は、世界予測と行動実行を厳密に同じ時間リズムに束縛することは、身体化制御におけるビデオブランチの可能性を十分に活用していない可能性があると考える。そこで、我々はAHA-WAM（Asynchronous Horizon-Adaptive World-Action Model）を提案する。これは、デュアルDiffusion Transformer（DiT）アーキテクチャに基づいて構築され、この時間的非対称性を中心にワールド・アクション・モデリングを再編成する。AHA-WAMは、ビデオDiTを低頻度の世界プランナーとして実装し、過去の観測にわたってローリングキーバリューメモリを維持するとともに、長期にわたるシーンの進化を符号化する再利用可能なレイヤーごとの潜在コンテキストを公開する。一方、高頻度のアクションDiTは、レイヤーごとのジョイントアテンションを通じてこのコンテキストをクエリすることにより、短いアクションチャンクを閉ループで実行する。非同期実行をサポートするために、我々は「ホライズン・アダプティブ・オフセット訓練」と「観測誘導型ビデオコンテキストルーティング（OVCR）」を導入する。これらにより、アクション専門家は、ビデオDiTを再実行することなく、長期の世界コンテキストを活用しながら、リアルタイムの実行状態に応答し続けることができる。RoboTwinおよび実世界の操作タスクにおける実験では、AHA-WAMがロボットデータの事前学習なしで最先端の性能を達成し、RoboTwinで平均成功率92.80%、4つの実世界タスクで78.3%の成功率を達成し、さらにFast-WAMと比較して4.59倍の高速化により24.17 Hzの閉ループ制御を実現したことを示している。

English

World-action models have emerged as a promising paradigm for robot manipulation, jointly modeling visual scene dynamics and actions to inject physical priors into policy learning. However, existing world-action models couple world prediction and action execution at the same temporal resolution, forcing the world branch to model near-term frame variations that are redundant and weakly informative. We posit that strictly binding world prediction and action execution to the same temporal rhythm may underutilize the potential of the video branch for embodied control. Therefore, we propose AHA-WAM, an Asynchronous Horizon-Adaptive World-Action Model built on a dual Diffusion Transformer (DiT) architecture that reorganizes world-action modeling around this temporal asymmetry. AHA-WAM instantiates the video DiT as a low-frequency world planner that maintains rolling key-value memory over past observations and exposes reusable layerwise latent context encoding long-horizon scene evolution, while a high-frequency action DiT executes short action chunks in closed loop by querying this context through layerwise joint attention. To support asynchronous execution, we introduce horizon-adaptive offset training and Observation-Guided Video-Context Routing (OVCR), which together let the action expert exploit long-horizon world context while remaining responsive to real-time execution state without rerunning the video DiT. Experiments on RoboTwin and real-world manipulation tasks show that AHA-WAM achieves state-of-the-art performance without any robot-data pretraining, attaining 92.80% average success on RoboTwin and 78.3% success across 4 real-world tasks, while reaching 24.17 Hz closed-loop control with a 4.59x speedup over Fast-WAM.