UniT:邁向人類至人形機器人的策略學習與世界建模之統一物理語言
UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling
April 21, 2026
作者: Boyu Chen, Yi Chen, Lu Qiu, Jerry Bai, Yuying Ge, Yixiao Ge
cs.AI
摘要
人形機器人基礎模型的可擴展性正面臨機器人數據稀缺的瓶頸。儘管海量的第一人稱視角人類數據提供了可擴展的替代方案,但由於運動學特徵不匹配,跨越不同具身形態的鴻溝仍是根本性挑戰。我們提出UniT(基於視覺錨定的統一潛在動作標記器),該框架通過建立統一的物理語言實現人類到人形機器人的知識遷移。基於"異構運動學具有通用視覺後果"的哲學理念,UniT採用三分支交叉重構機制:動作預測視覺以將運動學錨定於物理結果,視覺重構動作以過濾無關視覺干擾項,同時融合分支將這些純化模態協同整合為具身無關的物理意圖共享離散潛在空間。我們在兩種範式中驗證UniT:1)策略學習(VLA-UniT):通過預測這些統一標記,該模型有效利用多樣化人類數據,在人形機器人仿真基準測試與實際部署中實現最優數據效率與強健的分布外泛化能力,尤其展現出零樣本任務遷移特性;2)世界建模(WM-UniT):通過以統一標記為條件對齊跨具身動力學,實現直接的人類到人形機器人動作遷移。這種對齊確保人類數據無縫轉化為增強的人形機器人視頻生成動作可控性。最終,通過誘導高度對齊的跨具身表徵(經t-SNE可視化實證顯示人類與人形機器人特徵收斂至共享流形),UniT為將海量人類知識蒸餾為通用型人形機器人能力提供了可擴展路徑。
English
Scaling humanoid foundation models is bottlenecked by the scarcity of robotic data. While massive egocentric human data offers a scalable alternative, bridging the cross-embodiment chasm remains a fundamental challenge due to kinematic mismatches. We introduce UniT (Unified Latent Action Tokenizer via Visual Anchoring), a framework that establishes a unified physical language for human-to-humanoid transfer. Grounded in the philosophy that heterogeneous kinematics share universal visual consequences, UniT employs a tri-branch cross-reconstruction mechanism: actions predict vision to anchor kinematics to physical outcomes, while vision reconstructs actions to filter out irrelevant visual confounders. Concurrently, a fusion branch synergies these purified modalities into a shared discrete latent space of embodiment-agnostic physical intents. We validate UniT across two paradigms: 1) Policy Learning (VLA-UniT): By predicting these unified tokens, it effectively leverages diverse human data to achieve state-of-the-art data efficiency and robust out-of-distribution (OOD) generalization on both humanoid simulation benchmark and real-world deployments, notably demonstrating zero-shot task transfer. 2) World Modeling (WM-UniT): By aligning cross-embodiment dynamics via unified tokens as conditions, it realizes direct human-to-humanoid action transfer. This alignment ensures that human data seamlessly translates into enhanced action controllability for humanoid video generation. Ultimately, by inducing a highly aligned cross-embodiment representation (empirically verified by t-SNE visualizations revealing the convergence of human and humanoid features into a shared manifold), UniT offers a scalable path to distill vast human knowledge into general-purpose humanoid capabilities.