統一的世界モデリング、言語推論、行動合成のための世界-言語-行動モデル

要旨

我々は、ワールド・ランゲージ・アクション（WLA）モデルを、身体化基盤モデルの新たなクラスとして提案する。WLAは、テキスト指示、画像、ロボットの状態を入力として、テキストによるサブタスク、サブゴール画像、ロボットのアクションを共同で予測する。これにより、ワールド・アクション・モデル（WAM）と同様に広範な一人称視点動画から学習する世界モデリングインターフェースと、視覚・言語・アクション（VLA）モデルと同様に複雑な長期課題を解決する言語推論能力を結合する。 WLAの中核には、WAMのような双方向拡散Transformerではなく、自己回帰（AR）Transformerバックボーンが用いられ、次の状態を予測する。この状態は、意味レベルのテキスト意図と補完的な詳細な物理的ダイナミクスから構成される。物理的ダイナミクスは、専用のWorld Expertに基づく世界モデリング目的関数によって監視され、Action Expertのための状態・アクション相関の特性評価を容易にするために活用される。WLAはメタクエリを活用して、世界予測が暗黙的にアクション生成に影響を与えるようにし、推論中に世界予測を無効化できるようにする。世界予測は、テスト時スケーリングを有効にしてロボット制御を改善するために、アクティブ化することもできる。我々のWLA-0プロトタイプは、2Bのアクティブパラメータを持ち、NVIDIA RTX 5090上で推論あたり40ミリ秒を達成する。シミュレーション環境と実世界環境にわたる評価により、WLA-0が最先端のマルチタスクおよび長期学習能力を達成することが実証されている。例えば、RoboTwin2.0 Cleanでは92.94%、RMBenchでは56.5%の成功率である。WLA-0はまた、アクションアノテーションなしで、異なる身体性のロボット動画から直接新しいタスクを学習する可能性を秘めている。

English

We propose world-language-action (WLA) models as a new class of embodied foundation models. WLA takes textual instructions, images, and robot states as inputs to jointly predict textual subtasks, subgoal images, and robot actions, conjoining the world modeling interface to learn from extensive egocentric videos as in the world-action model (WAM) and the language reasoning capacities to solve complex long-horizon tasks as in vision-language-action (VLA) models. At the core of WLA lies an autoregressive (AR) Transformer backbone, instead of a bidirectional diffusion Transformer as in WAMs, to predict the next state, comprising the semantic-level textual intention and complementary fine-grained physical dynamics. The physical dynamics are supervised by the world modeling objective based on a dedicated World Expert, and are leveraged to ease the characterization of the state-action correlation for the Action Expert. WLA leverages meta-queries to make the world prediction implicitly impact the action generation so that the former can be disabled during inference. The world prediction can also be activated to enable test-time scaling for improved robot control. Our WLA-0 prototype, with 2B active parameters, achieves 40 ms per inference on an NVIDIA RTX 5090. Evaluations across simulated and real-world environments demonstrate that WLA-0 achieves state-of-the-art multi-task and long-horizon learning abilities, e.g., 92.94\% success rate on RoboTwin2.0 Clean and 56.5\% success rate on RMBench. WLA-0 also holds the promise to learn novel tasks directly from cross-embodiment robot videos without action annotations.