面向统一世界建模、语言推理与动作合成的世界-语言-动作模型

摘要

我们提出世界-语言-动作（WLA）模型，作为一类新型的具身基础模型。WLA将文本指令、图像和机器人状态作为输入，联合预测文本子任务、子目标图像和机器人动作，融合了世界建模接口（如世界-动作模型WAM）从大规模自我中心视频中学习的能力，以及语言推理能力（如视觉-语言-动作VLA模型）解决复杂长时程任务的能力。WLA的核心是一个自回归（AR）Transformer主干网络（而非WAM中的双向扩散Transformer），用于预测下一状态，包括语义层面的文本意图和互补的细粒度物理动态。物理动态通过基于专用世界专家（World Expert）的世界建模目标进行监督，并用于简化动作专家（Action Expert）对状态-动作相关性的刻画。WLA利用元查询（meta-queries）使世界预测隐式地影响动作生成，从而在推理阶段可禁用世界预测功能；同时，世界预测也可被激活以实现测试时扩展（test-time scaling），提升机器人控制性能。我们的WLA-0原型模型拥有20亿活跃参数，在NVIDIA RTX 5090上每次推理仅需40毫秒。在模拟与真实环境中的评估表明，WLA-0在多任务与长时程学习能力上达到最先进水平，例如在RoboTwin2.0 Clean数据集上成功率达92.94%，在RMBench上成功率达56.5%。WLA-0还具备直接从跨形态机器人视频中学习新任务的潜力，且无需动作标注。

English

We propose world-language-action (WLA) models as a new class of embodied foundation models. WLA takes textual instructions, images, and robot states as inputs to jointly predict textual subtasks, subgoal images, and robot actions, conjoining the world modeling interface to learn from extensive egocentric videos as in the world-action model (WAM) and the language reasoning capacities to solve complex long-horizon tasks as in vision-language-action (VLA) models. At the core of WLA lies an autoregressive (AR) Transformer backbone, instead of a bidirectional diffusion Transformer as in WAMs, to predict the next state, comprising the semantic-level textual intention and complementary fine-grained physical dynamics. The physical dynamics are supervised by the world modeling objective based on a dedicated World Expert, and are leveraged to ease the characterization of the state-action correlation for the Action Expert. WLA leverages meta-queries to make the world prediction implicitly impact the action generation so that the former can be disabled during inference. The world prediction can also be activated to enable test-time scaling for improved robot control. Our WLA-0 prototype, with 2B active parameters, achieves 40 ms per inference on an NVIDIA RTX 5090. Evaluations across simulated and real-world environments demonstrate that WLA-0 achieves state-of-the-art multi-task and long-horizon learning abilities, e.g., 92.94\% success rate on RoboTwin2.0 Clean and 56.5\% success rate on RMBench. WLA-0 also holds the promise to learn novel tasks directly from cross-embodiment robot videos without action annotations.