World Pilot: Steuerung von Vision-Language-Action-Modellen mit Welt-Aktions-Prioren

Zusammenfassung

Vision-Language-Action (VLA)-Modelle erben semantische Verankerung aus groß angelegtem Vortraining und erbringen bei Manipulationsaufgaben innerhalb der Verteilungsgrenzen kompetente Leistungen. Diese Verankerung basiert jedoch auf statischen Bild-Text-Paaren, während Manipulation ein kontinuierlicher, kontaktintensiver Prozess ist, dessen Dynamik dieses Vortraining nicht erfassen kann. Wir präsentieren World Pilot, ein VLA-Framework, das die Politik durch Prioren aus einem World-Action-Modell (WAM) erweitert, die über zwei komplementäre Pfade in die Entscheidungskette eingebunden werden. Latent Steering konditioniert die Wahrnehmungsschicht auf ein Szenenentwicklungs-Latent, und Action Steering liefert eine antizipierte Trajektorie als Bewegungs-Prior für den Aktionsgenerator. Zusammen statten die beiden Prioren das VLA mit einer antizipierten Sicht auf die Szene und einem bewegungsbezogenen Hinweis auf Trajektorienebene neben seiner semantischen Konditionierung aus, und der Szenenentwicklungs-Prior bleibt auch dann wirksam, wenn er von einem videovortrainierten Weltmodell bereitgestellt wird, das nicht aktions-nachtrainiert wurde. World Pilot erreicht eine Gesamterfolgsrate von 84,7 % auf dem LIBERO-Plus Nullschuss-OOD-Benchmark und die höchste Erfolgsrate in jeder realen Robotikumgebung über vier Manipulationsaufgaben hinweg, mit den größten Abständen bei Verschiebungen von Blickwinkel, Geometrie, deformierbarem Zustand und Pose. Projekt-Website: https://world-pilot.github.io/

English

Vision-Language-Action (VLA) models inherit semantic grounding from large-scale pretraining and perform competently across in-distribution manipulation tasks. This grounding, however, is built on static image-text pairs, whereas manipulation is a continuous, contact-rich process whose dynamics this pretraining cannot capture. We present World Pilot, a VLA framework that augments the policy with priors from a World-Action Model (WAM), routed into the decision chain through two complementary pathways. Latent Steering conditions the perception layer on a scene-evolution latent, and Action Steering supplies an anticipated trajectory as a motion prior to the action generator. Together the two priors equip the VLA with an anticipated view of the scene and a trajectory-level motion hint alongside its semantic conditioning, and the scene-evolution prior remains effective even when supplied by a video-pretrained world model that has not been action-post-trained. World Pilot attains a state-of-the-art Total success rate of 84.7% on the LIBERO-Plus zero-shot OOD benchmark and the highest success rate on every real-robot setting across four manipulation tasks, with the largest margins under shifts in viewpoint, geometry, deformable state, and pose. Project Website: https://world-pilot.github.io/