World Pilot : Orienter les modèles vision-langage-action avec des a priori monde-action

Résumé

Les modèles Vision-Langage-Action (VLA) héritent d'un ancrage sémantique issu d'un préentraînement à grande échelle et obtiennent des performances compétentes sur des tâches de manipulation intra-distribution. Cet ancrage repose cependant sur des paires image-texte statiques, alors que la manipulation est un processus continu et riche en contacts dont la dynamique échappe à ce préentraînement. Nous présentons World Pilot, un cadre VLA qui enrichit la politique avec des a priori issus d'un modèle monde-action (WAM), injectés dans la chaîne décisionnelle via deux voies complémentaires. Le Guidage Latent conditionne la couche de perception à l'aide d'un latent d'évolution de scène, tandis que le Guidage d'Action fournit une trajectoire anticipée comme a priori de mouvement au générateur d'actions. Conjointement, ces deux a priori dotent le VLA d'une vision anticipée de la scène et d'une indication de mouvement au niveau de la trajectoire, en complément de son conditionnement sémantique ; l'a priori d'évolution de scène reste efficace même lorsqu'il est fourni par un modèle du monde préentraîné sur vidéos sans post-entraînement sur actions. World Pilot atteint un taux de succès total de 84,7 % sur le benchmark zero-shot OOD LIBERO-Plus, ainsi que le taux de succès le plus élevé pour chaque configuration de robot réel parmi quatre tâches de manipulation, avec les plus grandes marges en cas de changements de point de vue, de géométrie, d'état déformable et de pose. Site web du projet : https://world-pilot.github.io/

English

Vision-Language-Action (VLA) models inherit semantic grounding from large-scale pretraining and perform competently across in-distribution manipulation tasks. This grounding, however, is built on static image-text pairs, whereas manipulation is a continuous, contact-rich process whose dynamics this pretraining cannot capture. We present World Pilot, a VLA framework that augments the policy with priors from a World-Action Model (WAM), routed into the decision chain through two complementary pathways. Latent Steering conditions the perception layer on a scene-evolution latent, and Action Steering supplies an anticipated trajectory as a motion prior to the action generator. Together the two priors equip the VLA with an anticipated view of the scene and a trajectory-level motion hint alongside its semantic conditioning, and the scene-evolution prior remains effective even when supplied by a video-pretrained world model that has not been action-post-trained. World Pilot attains a state-of-the-art Total success rate of 84.7% on the LIBERO-Plus zero-shot OOD benchmark and the highest success rate on every real-robot setting across four manipulation tasks, with the largest margins under shifts in viewpoint, geometry, deformable state, and pose. Project Website: https://world-pilot.github.io/