世界导航器：利用世界动作先验引导视觉-语言-动作模型

摘要

视觉-语言-动作（VLA）模型通过大规模预训练继承了语义基础，并在分布内操作任务中表现良好。然而，这种语义基础建立在静态图像-文本对之上，而操作是一个连续的、富含接触的过程，其动态特性是预训练无法捕获的。我们提出World Pilot，这是一个VLA框架，通过两条互补路径将世界-动作模型（WAM）的先验知识注入决策链：潜在引导利用场景演化潜在变量调节感知层，动作引导则提供预期轨迹作为动作生成器的运动先验。这两个先验共同赋予VLA场景演化视角和轨迹级运动提示，并与其语义条件结合；即使使用未经过动作后训练的视频预训练世界模型提供的场景演化先验，仍能保持有效性。World Pilot在LIBERO-Plus零样本跨域测试基准上达到84.7%的总成功率，并在四项操作任务的每个真实机器人场景中均取得最高成功率，在视角、几何形态、可变形状态和位姿变化下展现最大优势。项目网站：https://world-pilot.github.io/

English

Vision-Language-Action (VLA) models inherit semantic grounding from large-scale pretraining and perform competently across in-distribution manipulation tasks. This grounding, however, is built on static image-text pairs, whereas manipulation is a continuous, contact-rich process whose dynamics this pretraining cannot capture. We present World Pilot, a VLA framework that augments the policy with priors from a World-Action Model (WAM), routed into the decision chain through two complementary pathways. Latent Steering conditions the perception layer on a scene-evolution latent, and Action Steering supplies an anticipated trajectory as a motion prior to the action generator. Together the two priors equip the VLA with an anticipated view of the scene and a trajectory-level motion hint alongside its semantic conditioning, and the scene-evolution prior remains effective even when supplied by a video-pretrained world model that has not been action-post-trained. World Pilot attains a state-of-the-art Total success rate of 84.7% on the LIBERO-Plus zero-shot OOD benchmark and the highest success rate on every real-robot setting across four manipulation tasks, with the largest margins under shifts in viewpoint, geometry, deformable state, and pose. Project Website: https://world-pilot.github.io/