로봇은 VLA와 월드 모델 그 이상을 필요로 한다

초록

범용 로봇 지능은 종종 정책 확장 문제로 프레임화된다: 더 많은 로봇 시연 데이터를 수집하고, 더 큰 시각-언어-행동(VLA) 모델을 훈련하며, 더 넓은 일반화를 기대하는 것이다. 본 논설에서는 이러한 프레임화가 불완전하다고 주장한다. 핵심 병목은 정책 학습만이 아니라, 세상에 풍부하게 존재하는 구조화되지 않은 행동 데이터를 기반 마련된 로봇 감독으로 변환하는 메커니즘의 부재에 있다. 인간의 움직임, 인터넷 영상, 시뮬레이션 실행 결과, 상호작용 시연 데이터는 작업, 목표, 접촉, 실패, 물리적 제약에 대한 풍부한 정보를 포함하지만, 이러한 정보 대부분은 구현체 특화 행동 레이블, 작업 의미론, 보상 구조가 부족하여 로봇 정책이 직접 사용할 수 없다. 우리는 차세대 로봇 공학을 위한 네 가지 누락 요소를 식별한다: 구조화되지 않은 행동을 자동 레이블링하기 위한 데이터 인터페이스, 인간의 움직임을 로봇 행동으로 재타겟팅하기 위한 구현체 인터페이스, 물리 기반 3D 추론을 위한 세계 모델 인터페이스, 그리고 영상과 언어로부터 작업 진행 및 성공을 추론하기 위한 보상 인터페이스이다. 우리는 로봇 기반 모델, 교차 구현체 데이터셋, 영상으로부터의 학습, 세계 모델, 보상 모델링 분야의 최근 진전을 살펴보고, 로봇 시연뿐만 아니라 더 넓은 물리적 세계로부터 학습할 수 있는 로봇 시스템을 구축하기 위한 연구 의제를 제안한다.

English

Generalist robot intelligence is often framed as a policy-scaling problem: collect more robot demonstrations, train larger Vision-Language-Action (VLA) models, and expect broader generalisation. In this position paper, we argue that this framing is incomplete. The central bottleneck is not only policy learning, but the absence of mechanisms that convert the world's abundant unstructured behavioural data into grounded robot supervision. Human motion, internet video, simulation rollouts, and interactive demonstrations contain rich information about tasks, goals, contacts, failures, and physical constraints, yet most of this information is not directly usable by robot policies because it lacks embodiment-specific action labels, task semantics, and reward structure. We identify four missing components for the next generation of robotics: data interfaces for autolabelling unstructured behaviour, embodiment interfaces for retargeting human motion to robot actions, world-model interfaces for physics-grounded 3D reasoning, and reward interfaces for inferring task progress and success from video and language. We survey recent progress in robot foundation models, cross-embodiment datasets, learning from video, world models, and reward modelling, and propose a research agenda for building robotics systems that can learn not only from robot demonstrations, but from the broader physical world.