μ_0: 확장 가능한 3D 상호작용 추적 세계 모델

초록

행동이 물리적 변화를 유도하는 과정을 포착하는 월드 모델은 구현체 특화 행동 레이블 없이도 확장 가능한 로봇 학습을 가능하게 한다. 픽셀 공간 비디오 모델은 광범위한 시각적 사전 지식을 제공하지만 밀집된 외관 재구성에 모델 용량을 소모하는 반면, 직접 행동 모델은 구현체 특화 레이블을 필요로 하여 확장성을 저해한다. 본 논문에서는 3D 궤적에 기반한 확장 가능한 월드 모델인 μ_0를 제안한다. μ_0는 밀집 픽셀을 예측하거나 행동을 직접 모델링하는 대신, 객체, 도구, 손, 접촉 영역 등 주요 상호작용 지점에 대한 부드러운 3D 궤적을 예측하여, 구현체에 구애받지 않는 컴팩트한 모션 인터페이스를 제공한다. 다양한 비디오 소스로부터 훈련을 가능하게 하기 위해, TraceExtract 시스템은 키포인트를 선택하고, 전역적으로 정렬된 궤적을 구축하며, 모션 세그먼트를 계층적 언어 설명과 연관시킴으로써 3D 지도 신호를 자동으로 추출한다. 이 TraceExtract 지도 신호는 사전 훈련된 시각-언어 백본을 모듈식 궤적 전문가와 결합하여 μ_0를 사전 훈련시키며, 궤적 전문가는 각 질의를 B-스플라인 제어점으로 표현하고 미래 궤적을 예측한다. 실험 결과, μ_0는 궤적 예측 모델 및 토큰화된 VLM 방법을 포함한 2D 및 3D 궤적 예측 모두에서 기준 모델보다 우수한 성능을 보였다. μ_0는 고정되어 재사용 가능하므로, 하위 로봇 구현체를 위한 행동 전문가와 결합될 수 있다. 행동 없는 사전 훈련에도 불구하고, 결과적으로 얻어진 궤적 조건 정책은 π_0와 같은 행동 지도 신호로 사전 훈련된 VLA 모델과 경쟁할 만한 성능을 달성한다. 이러한 결과는 3D 궤적이 교차 구현체 조작을 위한 확장 가능하고 전이 가능한 표현임을 입증한다.

English

World models that capture how actions induce physical change enable scalable robot learning without reliance on embodiment-specific action labels. Pixel-space video models provide broad visual priors but expend model capacity on dense appearance reconstruction, while direct action models require embodiment-specific labels that hinder scalability. We present μ_0, a scalable world model based on 3D traces. Rather than predicting dense pixels or directly modeling actions, μ_0 forecasts smooth 3D trajectories for salient interaction points such as objects, tools, hands, and contact regions, yielding a compact, embodiment-agnostic motion interface. To enable training from diverse video sources, our TraceExtract system automatically extracts 3D supervision by selecting keypoints, constructing globally aligned traces, and associating motion segments with hierarchical language captions. This TraceExtract supervision pretrains μ_0 by combining a pretrained vision-language backbone with a modular trace expert, which represents each query via B-spline control points and predicts future traces. Experiments show that μ_0 outperforms baselines in both 2D and 3D trace prediction, including trace prediction models and tokenized VLM methods. Because μ_0 is frozen and reusable, it can be paired with action experts for downstream robot embodiments. Despite action-free pretraining, the resulting trace-conditioned policies achieve performance competitive with VLA models pretrained with action supervision, such as π_0. These results establish 3D traces as a scalable and transferable representation for cross-embodiment manipulation.