Wereldpiloot: Het sturen van visie-taal-actiemodellen met wereldactiepriors

Samenvatting

Visie-Taal-Actie (VLA)-modellen erven semantische verankering van grootschalige voorafgaande training en presteren bekwaam in manipulatietaken binnen de verdeling. Deze verankering is echter gebaseerd op statische beeld-tekstparen, terwijl manipulatie een continu, contactrijk proces is waarvan de dynamiek niet door deze voorafgaande training kan worden vastgelegd. We presenteren World Pilot, een VLA-raamwerk dat het beleid verrijkt met priori uit een Wereld-Actie Model (WAM), dat via twee complementaire paden in de beslissingsketen wordt geleid. Latente Sturing conditioneert de perceptielaag op een latent van scène-evolutie, en Actiesturing levert een verwachte trajectorie als bewegingsprior voor de actiegenerator. Samen voorzien de twee priori de VLA van een verwacht beeld van de scène en een trajectorie-niveau bewegingshint naast de semantische conditionering, en de scène-evolutieprior blijft effectief, zelfs wanneer geleverd door een op video voorgetraind wereldmodel dat niet actie-nabewerkt is. World Pilot behaalt een state-of-the-art Totaal succespercentage van 84,7% op de LIBERO-Plus zero-shot OOD-benchmark en het hoogste succespercentage in elke echte robotomgeving bij vier manipulatie taken, met de grootste marges onder verschuivingen in gezichtspunt, geometrie, vervormbare toestand en pose. Projectwebsite: https://world-pilot.github.io/

English

Vision-Language-Action (VLA) models inherit semantic grounding from large-scale pretraining and perform competently across in-distribution manipulation tasks. This grounding, however, is built on static image-text pairs, whereas manipulation is a continuous, contact-rich process whose dynamics this pretraining cannot capture. We present World Pilot, a VLA framework that augments the policy with priors from a World-Action Model (WAM), routed into the decision chain through two complementary pathways. Latent Steering conditions the perception layer on a scene-evolution latent, and Action Steering supplies an anticipated trajectory as a motion prior to the action generator. Together the two priors equip the VLA with an anticipated view of the scene and a trajectory-level motion hint alongside its semantic conditioning, and the scene-evolution prior remains effective even when supplied by a video-pretrained world model that has not been action-post-trained. World Pilot attains a state-of-the-art Total success rate of 84.7% on the LIBERO-Plus zero-shot OOD benchmark and the highest success rate on every real-robot setting across four manipulation tasks, with the largest margins under shifts in viewpoint, geometry, deformable state, and pose. Project Website: https://world-pilot.github.io/