UniT：构建面向人形机器人策略学习与世界建模的统一物理语言

摘要

人形机器人基础模型的规模化受限于机器人数据的稀缺性。尽管海量第一人称人类数据提供了可扩展的替代方案，但由于运动学差异，跨越不同具身形态的鸿沟仍是根本性挑战。我们提出UniT（基于视觉锚定的统一潜在动作分词器），该框架通过建立统一的物理语言实现人类到人形机器人的知识迁移。基于"异构运动学具有普适视觉结果"的哲学思想，UniT采用三分支交叉重建机制：动作预测视觉以将运动学锚定于物理结果，视觉重建动作以过滤无关视觉干扰。同时，融合分支将这两种纯化模态协同编码为具身无关的物理意图共享离散潜在空间。我们在两大范式下验证UniT：1）策略学习（VLA-UniT）：通过预测这些统一令牌，模型能有效利用多样化人类数据，在人形机器人仿真基准和现实部署中实现最优数据效率与强健的分布外泛化能力，尤其展现出零样本任务迁移特性；2）世界建模（WM-UniT）：通过以统一令牌为条件对齐跨具身动力学，实现直接的人类到人形机器人动作迁移。这种对齐确保人类数据可无缝转化为增强的人形机器人视频生成动作可控性。最终，通过诱导高度对齐的跨具身表征（经t-SNE可视化实证显示人类与人形机器人特征收敛至共享流形），UniT为将海量人类知识蒸馏为通用人形机器人能力提供了可扩展路径。

English

Scaling humanoid foundation models is bottlenecked by the scarcity of robotic data. While massive egocentric human data offers a scalable alternative, bridging the cross-embodiment chasm remains a fundamental challenge due to kinematic mismatches. We introduce UniT (Unified Latent Action Tokenizer via Visual Anchoring), a framework that establishes a unified physical language for human-to-humanoid transfer. Grounded in the philosophy that heterogeneous kinematics share universal visual consequences, UniT employs a tri-branch cross-reconstruction mechanism: actions predict vision to anchor kinematics to physical outcomes, while vision reconstructs actions to filter out irrelevant visual confounders. Concurrently, a fusion branch synergies these purified modalities into a shared discrete latent space of embodiment-agnostic physical intents. We validate UniT across two paradigms: 1) Policy Learning (VLA-UniT): By predicting these unified tokens, it effectively leverages diverse human data to achieve state-of-the-art data efficiency and robust out-of-distribution (OOD) generalization on both humanoid simulation benchmark and real-world deployments, notably demonstrating zero-shot task transfer. 2) World Modeling (WM-UniT): By aligning cross-embodiment dynamics via unified tokens as conditions, it realizes direct human-to-humanoid action transfer. This alignment ensures that human data seamlessly translates into enhanced action controllability for humanoid video generation. Ultimately, by inducing a highly aligned cross-embodiment representation (empirically verified by t-SNE visualizations revealing the convergence of human and humanoid features into a shared manifold), UniT offers a scalable path to distill vast human knowledge into general-purpose humanoid capabilities.

UniT：构建面向人形机器人策略学习与世界建模的统一物理语言

UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling

摘要

Support