Welt-Sprache-Handlungs-Modell für vereinheitlichte Weltmodellierung, Sprachreasoning und Handlungssynthese

Zusammenfassung

Wir schlagen Welt-Sprache-Aktion (WLA) Modelle als eine neue Klasse verkörperter Fundamentmodelle vor. WLA verarbeitet textuelle Anweisungen, Bilder und Roboterzustände als Eingaben, um gemeinsam textuelle Teilaufgaben, Teilzielbilder und Roboteraktionen vorherzusagen. Dabei verbindet es die Schnittstelle zur Weltmodellierung, um wie im Welt-Aktion-Modell (WAM) aus umfangreichen egozentrischen Videos zu lernen, sowie die Fähigkeiten zur Sprachargumentation, um wie in Vision-Sprache-Aktion (VLA) Modellen komplexe langfristige Aufgaben zu lösen. Das Kernstück von WLA bildet ein autoregressiver (AR) Transformer-Backbone, der anstelle eines bidirektionalen Diffusions-Transformers wie in WAMs den nächsten Zustand vorhersagt, bestehend aus der semantischen textuellen Intention und komplementären feinkörnigen physikalischen Dynamiken. Die physikalischen Dynamiken werden durch das Weltmodellierungsziel basierend auf einem dedizierten Weltexperten überwacht und genutzt, um die Charakterisierung der Zustand-Aktions-Korrelation für den Aktionsinstanzexperten zu erleichtern. WLA nutzt Meta-Abfragen, um die Weltvorhersage implizit auf die Aktionsgenerierung wirken zu lassen, sodass erstere während der Inferenz deaktiviert werden kann. Die Weltvorhersage kann auch aktiviert werden, um eine Skalierung zur Testzeit für eine verbesserte Robotersteuerung zu ermöglichen. Unser WLA-0-Prototyp mit 2 Mrd. aktiven Parametern erreicht 40 ms pro Inferenz auf einer NVIDIA RTX 5090. Evaluierungen in simulierten und realen Umgebungen zeigen, dass WLA-0 hochmoderne Fähigkeiten bei Mehrfachaufgaben und langfristigem Lernen erzielt, z. B. eine Erfolgsrate von 92,94 % auf RoboTwin2.0 Clean und 56,5 % auf RMBench. WLA-0 verspricht zudem, neuartige Aufgaben direkt aus roboterübergreifenden Videos ohne Aktionsannotationen zu erlernen.

English

We propose world-language-action (WLA) models as a new class of embodied foundation models. WLA takes textual instructions, images, and robot states as inputs to jointly predict textual subtasks, subgoal images, and robot actions, conjoining the world modeling interface to learn from extensive egocentric videos as in the world-action model (WAM) and the language reasoning capacities to solve complex long-horizon tasks as in vision-language-action (VLA) models. At the core of WLA lies an autoregressive (AR) Transformer backbone, instead of a bidirectional diffusion Transformer as in WAMs, to predict the next state, comprising the semantic-level textual intention and complementary fine-grained physical dynamics. The physical dynamics are supervised by the world modeling objective based on a dedicated World Expert, and are leveraged to ease the characterization of the state-action correlation for the Action Expert. WLA leverages meta-queries to make the world prediction implicitly impact the action generation so that the former can be disabled during inference. The world prediction can also be activated to enable test-time scaling for improved robot control. Our WLA-0 prototype, with 2B active parameters, achieves 40 ms per inference on an NVIDIA RTX 5090. Evaluations across simulated and real-world environments demonstrate that WLA-0 achieves state-of-the-art multi-task and long-horizon learning abilities, e.g., 92.94\% success rate on RoboTwin2.0 Clean and 56.5\% success rate on RMBench. WLA-0 also holds the promise to learn novel tasks directly from cross-embodiment robot videos without action annotations.