μ_0：一种可扩展的三维交互轨迹世界模型

摘要

捕捉动作如何引发物理变化的世界模型，能够在无需依赖具身特定动作标签的情况下实现可扩展的机器人学习。像素空间视频模型提供了广泛的视觉先验，但将模型能力消耗在密集外观重建上；而直接动作模型则需要具身特定标签，这阻碍了可扩展性。我们提出μ₀——一种基于3D轨迹的可扩展世界模型。该模型不预测密集像素或直接建模动作，而是预测显著交互点（如物体、工具、手部及接触区域）的平滑3D轨迹，形成一种紧凑且与具身无关的运动接口。为支持从多样化视频源进行训练，我们的TraceExtract系统通过选择关键点、构建全局对齐轨迹、将运动段与分层语言描述相关联，自动提取3D监督信号。这种TraceExtract监督信号将预训练的视觉-语言骨干与模块化轨迹专家相结合，以预训练μ₀：该轨迹专家通过B样条控制点表征每个查询点并预测未来轨迹。实验表明，μ₀在2D和3D轨迹预测（包括轨迹预测模型和标记化VLM方法）上均优于基线方法。由于μ₀可冻结复用，它能与动作专家配对，用于下游机器人具身。尽管采用无动作预训练，由此产生的轨迹条件化策略在性能上可与使用动作监督预训练的VLA模型（如π₀）相媲美。这些结果确立了3D轨迹作为跨具身操纵任务中可扩展、可迁移的表示。

English

World models that capture how actions induce physical change enable scalable robot learning without reliance on embodiment-specific action labels. Pixel-space video models provide broad visual priors but expend model capacity on dense appearance reconstruction, while direct action models require embodiment-specific labels that hinder scalability. We present μ_0, a scalable world model based on 3D traces. Rather than predicting dense pixels or directly modeling actions, μ_0 forecasts smooth 3D trajectories for salient interaction points such as objects, tools, hands, and contact regions, yielding a compact, embodiment-agnostic motion interface. To enable training from diverse video sources, our TraceExtract system automatically extracts 3D supervision by selecting keypoints, constructing globally aligned traces, and associating motion segments with hierarchical language captions. This TraceExtract supervision pretrains μ_0 by combining a pretrained vision-language backbone with a modular trace expert, which represents each query via B-spline control points and predicts future traces. Experiments show that μ_0 outperforms baselines in both 2D and 3D trace prediction, including trace prediction models and tokenized VLM methods. Because μ_0 is frozen and reusable, it can be paired with action experts for downstream robot embodiments. Despite action-free pretraining, the resulting trace-conditioned policies achieve performance competitive with VLA models pretrained with action supervision, such as π_0. These results establish 3D traces as a scalable and transferable representation for cross-embodiment manipulation.