行動から世界モデリングへの転移可能なダイナミクス事前分布の学習

要旨

我々は、ロボット学習のための転移可能なダイナミクス事前分布を学習するスケーラブルな方法として、行動条件付き世界モデリングを研究する。行動が視覚シーンの変化をどのように駆動するかを予測するようにモデルを事前学習することで、得られた世界モデルは外観レベルのビデオ生成を超えた再利用可能なインタラクションダイナミクスを捉える。具体的には、実際の行動アノテーションが付与された大規模なロボット操作データを用いて、マルチビューインタラクティブベース拡散世界モデルA2Worldを事前学習する。我々は、学習されたダイナミクス事前分布を二つの相補的な観点から検証する。まず、A2Worldをタスクまたはシーン特化型実世界シミュレータA2World-simに適応させる。その長期ロールアウトは、実ロボットロールアウトを世界モデルロールアウトに置き換えることで、シミュレータベースのポリシー評価とスケーラブルなwhat-if分析をサポートする。次に、同じ事前学習済み重みから出発して、A2Worldを視覚と指示の条件付けの下で行動を予測するビデオ・行動統合予測モデルA2World-policyに適応させる。シミュレーションベンチマークと実ロボット設定にわたる実験により、行動条件付き世界モデルの事前学習が、シミュレータ中心およびポリシー中心の両方のロボット学習に利益をもたらす転移可能なダイナミクス事前分布をもたらすことが実証される。

English

We study action-conditioned world modeling as a scalable way to learn transferable dynamics priors for robot learning. By pretraining a model to predict how actions drive visual scene evolution, the resulting world model captures reusable interaction dynamics beyond appearance-level video generation. Concretely, we pretrain a multi-view interactive base diffusion world model, A2World, on large-scale robot manipulation data with real action annotations. We validate the learned dynamics priors from two complementary perspectives. First, we adapt A2World into a task- or scene-specialized real-world simulator, A2World-sim, whose long-horizon rollouts support simulator-based policy evaluation and scalable what-if analysis by replacing real-robot rollouts with world model rollouts. Second, starting from the same pretrained weights, we adapt A2World into a video-action joint prediction model, A2World-policy, that predicts actions under visual and instruction conditioning. Experiments across simulation benchmarks and real-robot settings demonstrate that action-conditioned world model pretraining yields transferable dynamics priors that benefit both simulator-centric and policy-centric robot learning.