用於統一世界建模、語言推理與行動合成的世界-語言-行動模型

摘要

我們提出世界-語言-動作（WLA）模型，作為一類新型的具身基礎模型。WLA以文本指令、圖像和機器人狀態為輸入，聯合預測文本子任務、子目標圖像和機器人動作，融合了世界模型介面（如世界-動作模型WAM）從大量自我中心影片中學習的能力，以及視覺-語言-動作（VLA）模型解決複雜長程任務的語言推理能力。WLA的核心是一個自迴歸（AR）Transformer骨幹網絡，而非WAM中的雙向擴散Transformer，用於預測下一個狀態，包括語義層面的文本意圖和互補的細粒度物理動態。物理動態由基於專用世界專家（World Expert）的世界建模目標監督，並用於簡化動作專家（Action Expert）的狀態-動作相關性表徵。WLA利用元查詢使世界預測隱式影響動作生成，從而在推理時可禁用世界預測。世界預測也可被激活以實現測試時擴展，提升機器人控制能力。我們的WLA-0原型擁有20億活躍參數，在NVIDIA RTX 5090上每次推理僅需40毫秒。在模擬和真實環境中的評估表明，WLA-0實現了最先進的多任務和長程學習能力，例如在RoboTwin2.0 Clean上達到92.94%的成功率，在RMBench上達到56.5%的成功率。WLA-0還展現了直接從未標註動作的跨具身機器人影片中學習新任務的潛力。

English

We propose world-language-action (WLA) models as a new class of embodied foundation models. WLA takes textual instructions, images, and robot states as inputs to jointly predict textual subtasks, subgoal images, and robot actions, conjoining the world modeling interface to learn from extensive egocentric videos as in the world-action model (WAM) and the language reasoning capacities to solve complex long-horizon tasks as in vision-language-action (VLA) models. At the core of WLA lies an autoregressive (AR) Transformer backbone, instead of a bidirectional diffusion Transformer as in WAMs, to predict the next state, comprising the semantic-level textual intention and complementary fine-grained physical dynamics. The physical dynamics are supervised by the world modeling objective based on a dedicated World Expert, and are leveraged to ease the characterization of the state-action correlation for the Action Expert. WLA leverages meta-queries to make the world prediction implicitly impact the action generation so that the former can be disabled during inference. The world prediction can also be activated to enable test-time scaling for improved robot control. Our WLA-0 prototype, with 2B active parameters, achieves 40 ms per inference on an NVIDIA RTX 5090. Evaluations across simulated and real-world environments demonstrate that WLA-0 achieves state-of-the-art multi-task and long-horizon learning abilities, e.g., 92.94\% success rate on RoboTwin2.0 Clean and 56.5\% success rate on RMBench. WLA-0 also holds the promise to learn novel tasks directly from cross-embodiment robot videos without action annotations.