ワールドパイロット：世界行動事前分布を用いた視覚・言語・行動モデルの誘導

要旨

Vision-Language-Action（VLA）モデルは、大規模な事前学習から意味的基盤を受け継ぎ、分布内の操作タスクで良好な性能を発揮する。しかし、この基盤は静的な画像-テキストペアに基づいて構築されており、操作は連続的で接触の多いプロセスであり、そのダイナミクスを事前学習では捉えることができない。本稿では、World Pilotを提案する。これは、World-Action Model（WAM）からの事前知識を、二つの相補的な経路を通じて意思決定連鎖に組み込むVLAフレームワークである。Latent Steeringは、シーン進化の潜在変数によって知覚層を条件付け、Action Steeringは、予測軌道を動作事前知識として行動生成器に供給する。これら二つの事前知識により、VLAは意味的条件付けに加えて、シーンの予測的な視点と軌道レベルの動作ヒントを得る。また、シーン進化の事前知識は、行動後訓練されていないビデオ事前学習済み世界モデルから供給された場合でも効果を発揮する。World Pilotは、LIBERO-PlusゼロショットOODベンチマークで総合成功率84.7%の最先端成果を達成し、4つの操作タスクにわたるすべての実ロボット設定で最高の成功率を示し、特に視点、幾何形状、変形状態、姿勢の変化において最も大きなマージンを達成した。プロジェクトウェブサイト: https://world-pilot.github.io/

English

Vision-Language-Action (VLA) models inherit semantic grounding from large-scale pretraining and perform competently across in-distribution manipulation tasks. This grounding, however, is built on static image-text pairs, whereas manipulation is a continuous, contact-rich process whose dynamics this pretraining cannot capture. We present World Pilot, a VLA framework that augments the policy with priors from a World-Action Model (WAM), routed into the decision chain through two complementary pathways. Latent Steering conditions the perception layer on a scene-evolution latent, and Action Steering supplies an anticipated trajectory as a motion prior to the action generator. Together the two priors equip the VLA with an anticipated view of the scene and a trajectory-level motion hint alongside its semantic conditioning, and the scene-evolution prior remains effective even when supplied by a video-pretrained world model that has not been action-post-trained. World Pilot attains a state-of-the-art Total success rate of 84.7% on the LIBERO-Plus zero-shot OOD benchmark and the highest success rate on every real-robot setting across four manipulation tasks, with the largest margins under shifts in viewpoint, geometry, deformable state, and pose. Project Website: https://world-pilot.github.io/