Wereld-Taal-Actie Model voor Geïntegreerde Wereldmodellering, Taalredeneren en Actiesynthese

Samenvatting

Wij stellen wereld-taal-actiemodellen (WLA-modellen) voor als een nieuwe klasse van belichaamde funderingsmodellen. WLA neemt tekstuele instructies, afbeeldingen en robottoestanden als invoer om gezamenlijk tekstuele subtaken, subdoelafbeeldingen en robotacties te voorspellen, waarbij de wereldmodelleringsinterface wordt samengevoegd om te leren van uitgebreide egocentrische video's zoals in het wereld-actiemodel (WAM) en de taalredeneringscapaciteiten om complexe langetermijntaken op te lossen zoals in visie-taal-actiemodellen (VLA-modellen). De kern van WLA wordt gevormd door een autoregressieve (AR) Transformator-backbone, in plaats van een bidirectionele diffusie-Transformator zoals in WAM's, om de volgende toestand te voorspellen, bestaande uit de semantische tekstuele intentie en complementaire fijnmazige fysieke dynamica. De fysieke dynamica wordt gesuperviseerd door de wereldmodelleringsdoelstelling op basis van een toegewijde WereldExpert, en wordt gebruikt om de karakterisering van de toestand-actiecorrelatie voor de ActieExpert te vereenvoudigen. WLA gebruikt meta-query's om ervoor te zorgen dat de wereldvoorspelling impliciet van invloed is op de actiegeneratie, zodat de eerste kan worden uitgeschakeld tijdens inferentie. De wereldvoorspelling kan ook worden geactiveerd om testtijdschaling mogelijk te maken voor verbeterde robotbesturing. Ons WLA-0-prototype, met 2B actieve parameters, bereikt 40 ms per inferentie op een NVIDIA RTX 5090. Evaluaties in zowel gesimuleerde als echte omgevingen tonen aan dat WLA-0 state-of-the-art multitask- en langetermijnleervermogens behaalt, bijvoorbeeld een slagingspercentage van 92,94% op RoboTwin2.0 Clean en 56,5% op RMBench. WLA-0 belooft ook nieuwe taken rechtstreeks te leren van cross-embodiment robotvideo's zonder actieannotaties.

English

We propose world-language-action (WLA) models as a new class of embodied foundation models. WLA takes textual instructions, images, and robot states as inputs to jointly predict textual subtasks, subgoal images, and robot actions, conjoining the world modeling interface to learn from extensive egocentric videos as in the world-action model (WAM) and the language reasoning capacities to solve complex long-horizon tasks as in vision-language-action (VLA) models. At the core of WLA lies an autoregressive (AR) Transformer backbone, instead of a bidirectional diffusion Transformer as in WAMs, to predict the next state, comprising the semantic-level textual intention and complementary fine-grained physical dynamics. The physical dynamics are supervised by the world modeling objective based on a dedicated World Expert, and are leveraged to ease the characterization of the state-action correlation for the Action Expert. WLA leverages meta-queries to make the world prediction implicitly impact the action generation so that the former can be disabled during inference. The world prediction can also be activated to enable test-time scaling for improved robot control. Our WLA-0 prototype, with 2B active parameters, achieves 40 ms per inference on an NVIDIA RTX 5090. Evaluations across simulated and real-world environments demonstrate that WLA-0 achieves state-of-the-art multi-task and long-horizon learning abilities, e.g., 92.94\% success rate on RoboTwin2.0 Clean and 56.5\% success rate on RMBench. WLA-0 also holds the promise to learn novel tasks directly from cross-embodiment robot videos without action annotations.