통합 세계 모델링, 언어 추론 및 행동 합성을 위한 세계-언어-행동 모델

초록

우리는 세계-언어-행동(WLA; world-language-action) 모델을 새로운 종류의 구현 기반 모델(embodied foundation model)로 제안한다. WLA는 텍스트 명령, 이미지, 로봇 상태를 입력으로 받아 텍스트 하위 작업, 하위 목표 이미지, 로봇 행동을 공동으로 예측함으로써, 세계-행동 모델(WAM)처럼 광범위한 1인칭 시점 비디오로부터 학습하는 세계 모델링 인터페이스와, 시각-언어-행동(VLA) 모델처럼 복잡한 장기 작업을 해결하는 언어 추론 능력을 결합한다. WLA의 핵심에는 양방향 확산 트랜스포머를 사용하는 WAM과 달리, 의미 수준의 텍스트 의도와 이를 보완하는 세부적인 물리적 역학을 포함하는 다음 상태를 예측하기 위한 자기회귀(AR) 트랜스포머 백본이 자리한다. 물리적 역학은 전용 World Expert에 기반한 세계 모델링 목표로 감독되며, Action Expert의 상태-행동 상관관계 특성화를 용이하게 하는 데 활용된다. WLA는 메타 쿼리(meta-query)를 사용하여 세계 예측이 행동 생성에 암묵적으로 영향을 미치도록 함으로써, 추론 시 전자를 비활성화할 수 있다. 세계 예측은 활성화되어 테스트 시 스케일링을 가능하게 하여 로봇 제어 성능을 향상시킬 수도 있다. 우리의 WLA-0 프로토타입은 2B의 활성 파라미터를 가지며, NVIDIA RTX 5090에서 추론당 40ms를 달성한다. 시뮬레이션 및 실제 환경에서의 평가를 통해 WLA-0이 최첨단 다중 작업 및 장기 작업 학습 능력을 달성함을 입증했다. 예를 들어 RoboTwin2.0 Clean에서 92.94%의 성공률, RMBench에서 56.5%의 성공률을 기록했다. 또한 WLA-0은 행동 주석 없이 교차 구현 로봇 비디오로부터 직접 새로운 작업을 학습할 수 있는 가능성을 보여준다.

English

We propose world-language-action (WLA) models as a new class of embodied foundation models. WLA takes textual instructions, images, and robot states as inputs to jointly predict textual subtasks, subgoal images, and robot actions, conjoining the world modeling interface to learn from extensive egocentric videos as in the world-action model (WAM) and the language reasoning capacities to solve complex long-horizon tasks as in vision-language-action (VLA) models. At the core of WLA lies an autoregressive (AR) Transformer backbone, instead of a bidirectional diffusion Transformer as in WAMs, to predict the next state, comprising the semantic-level textual intention and complementary fine-grained physical dynamics. The physical dynamics are supervised by the world modeling objective based on a dedicated World Expert, and are leveraged to ease the characterization of the state-action correlation for the Action Expert. WLA leverages meta-queries to make the world prediction implicitly impact the action generation so that the former can be disabled during inference. The world prediction can also be activated to enable test-time scaling for improved robot control. Our WLA-0 prototype, with 2B active parameters, achieves 40 ms per inference on an NVIDIA RTX 5090. Evaluations across simulated and real-world environments demonstrate that WLA-0 achieves state-of-the-art multi-task and long-horizon learning abilities, e.g., 92.94\% success rate on RoboTwin2.0 Clean and 56.5\% success rate on RMBench. WLA-0 also holds the promise to learn novel tasks directly from cross-embodiment robot videos without action annotations.