μ_0: スケーラブルな3Dインタラクショントレース世界モデル

要旨

動作が物理的変化を誘発する仕組みを捉える世界モデルは、身体に固有の行動ラベルに依存することなく、スケーラブルなロボット学習を可能にする。ピクセル空間のビデオモデルは広範な視覚的事前知識を提供するものの、モデル容量を密な外観再構築に費やす。一方、直接行動モデルは身体固有のラベルを必要とし、スケーラビリティを阻害する。本論文では、3Dトレースに基づくスケーラブルな世界モデルμ_0を提案する。μ_0は、密なピクセルを予測したり行動を直接モデリングしたりするのではなく、物体、ツール、手、接触領域といった顕著な相互作用点の滑らかな3D軌道を予測し、コンパクトで身体に依存しない動作インターフェースを提供する。多様なビデオソースからの学習を可能にするため、我々のTraceExtractシステムは、キーポイントの選択、グローバルに整列されたトレースの構築、動作セグメントと階層的な言語キャプションとの関連付けを通じて、3D教師情報を自動的に抽出する。このTraceExtractによる教師情報は、事前学習済みの視覚言語バックボーンとモジュール式トレースエキスパートを組み合わせてμ_0を事前学習する。トレースエキスパートは各クエリをBスプライン制御点で表現し、将来のトレースを予測する。実験では、μ_0がトレース予測モデルやトークン化されたVLM手法を含むベースラインを、2Dおよび3Dトレース予測の両方で上回ることを示す。μ_0は凍結されて再利用可能であるため、下流のロボット身体に対する行動エキスパートと組み合わせることができる。行動なしの事前学習にもかかわらず、得られたトレース条件付きポリシーは、π_0などの行動教師ありで事前学習されたVLAモデルと競合する性能を達成する。これらの結果は、3Dトレースが身体横断的操作のためのスケーラブルで転送可能な表現であることを実証する。

English

World models that capture how actions induce physical change enable scalable robot learning without reliance on embodiment-specific action labels. Pixel-space video models provide broad visual priors but expend model capacity on dense appearance reconstruction, while direct action models require embodiment-specific labels that hinder scalability. We present μ_0, a scalable world model based on 3D traces. Rather than predicting dense pixels or directly modeling actions, μ_0 forecasts smooth 3D trajectories for salient interaction points such as objects, tools, hands, and contact regions, yielding a compact, embodiment-agnostic motion interface. To enable training from diverse video sources, our TraceExtract system automatically extracts 3D supervision by selecting keypoints, constructing globally aligned traces, and associating motion segments with hierarchical language captions. This TraceExtract supervision pretrains μ_0 by combining a pretrained vision-language backbone with a modular trace expert, which represents each query via B-spline control points and predicts future traces. Experiments show that μ_0 outperforms baselines in both 2D and 3D trace prediction, including trace prediction models and tokenized VLM methods. Because μ_0 is frozen and reusable, it can be paired with action experts for downstream robot embodiments. Despite action-free pretraining, the resulting trace-conditioned policies achieve performance competitive with VLA models pretrained with action supervision, such as π_0. These results establish 3D traces as a scalable and transferable representation for cross-embodiment manipulation.