μ_0：一個可擴展的3D交互軌跡世界模型

摘要

捕捉行為如何引發物理變化的世界模型，使得機器人學習無需依賴特定本體的行為標籤即可擴展。像素空間影片模型提供廣泛的視覺先驗，但將模型容量耗費在密集外觀重建上；而直接行為模型則需要特定本體的標籤，阻礙了可擴展性。我們提出 μ_0，一種基於三維軌跡的可擴展世界模型。與預測密集像素或直接建模行為不同，μ_0 預測物體、工具、手部及接觸區域等顯著互動點的平滑三維軌跡，形成一種緊湊且與本體無關的運動介面。為實現從多樣化影片來源進行訓練，我們的 TraceExtract 系統透過選取關鍵點、建構全局對齊軌跡，並將運動片段與層級化語言描述進行關聯，自動提取三維監督訊號。此 TraceExtract 監督訊號透過結合預訓練的視覺語言骨幹與模組化軌跡專家來預訓練 μ_0，其中軌跡專家以 B 樣條控制點表示每個查詢，並預測未來軌跡。實驗顯示，μ_0 在二維與三維軌跡預測上均優於基線模型，包括軌跡預測模型與標記化 VLM 方法。由於 μ_0 可凍結並重複使用，它能與行為專家配對，應用於下游機器人本體。儘管缺乏行為預訓練，所產生的軌跡條件化策略在效能上可與經行為監督預訓練的 VLA 模型（如 π_0）競爭。這些結果確立了三維軌跡作為跨本體操作的擴展且可遷移表徵。

English

World models that capture how actions induce physical change enable scalable robot learning without reliance on embodiment-specific action labels. Pixel-space video models provide broad visual priors but expend model capacity on dense appearance reconstruction, while direct action models require embodiment-specific labels that hinder scalability. We present μ_0, a scalable world model based on 3D traces. Rather than predicting dense pixels or directly modeling actions, μ_0 forecasts smooth 3D trajectories for salient interaction points such as objects, tools, hands, and contact regions, yielding a compact, embodiment-agnostic motion interface. To enable training from diverse video sources, our TraceExtract system automatically extracts 3D supervision by selecting keypoints, constructing globally aligned traces, and associating motion segments with hierarchical language captions. This TraceExtract supervision pretrains μ_0 by combining a pretrained vision-language backbone with a modular trace expert, which represents each query via B-spline control points and predicts future traces. Experiments show that μ_0 outperforms baselines in both 2D and 3D trace prediction, including trace prediction models and tokenized VLM methods. Because μ_0 is frozen and reusable, it can be paired with action experts for downstream robot embodiments. Despite action-free pretraining, the resulting trace-conditioned policies achieve performance competitive with VLA models pretrained with action supervision, such as π_0. These results establish 3D traces as a scalable and transferable representation for cross-embodiment manipulation.