UniT：ヒューマノイドへのポリシー学習と世界モデリングのための統一的物理言語に向けて

要旨

ヒューマノイド基盤モデルのスケーリングは、ロボットデータの不足によってボトルネックが生じている。膨大なエゴセントリックな人間データはスケーラブルな代替手段を提供するが、キネマティクスの不一致により、異なる身体構造間の隔たりを埋めることが根本的な課題として残る。本研究では、人間からヒューマノイドへの転移のための統一された物理言語を確立するフレームワークUniT（視覚的アンカリングによる統一潜在行動トークナイザ）を提案する。異種キネマティクスが普遍的な視覚的結果を共有するという哲学に基づき、UniTは3分岐の相互再構成メカニズムを採用する：行動は視覚を予測してキネマティクスを物理的結果に固定し、視覚は行動を再構成して無関係な視覚的混入要因をフィルタリングする。同時に、融合分岐はこれらの精製されたモダリティを、身体構造に依存しない物理的意図の共有離散潜在空間に統合する。UniTを2つのパラダイムで検証する：1）政策学習（VLA-UniT）：これらの統一トークンを予測することで、多様な人間データを効果的に活用し、ヒューマノイドシミュレーションベンチマークと実世界展開の両方において、最先端のデータ効率と堅牢な分布外一般化を達成し、特にゼロショットタスク転移を実証する。2）世界モデリング（WM-UniT）：統一トークンを条件として異身体構造間のダイナミクスを調整することで、人間からヒューマノイドへの直接的な行動転移を実現する。この調整により、人間データがヒューマノイドビデオ生成における強化された行動制御性にシームレスに変換される。最終的に、高度に調整された異身体構造間の表現（人間とヒューマノイドの特徴が共有多様体に収束することを示すt-SNE可視化により実証）を誘導することで、UniTは膨大な人間の知識を汎用ヒューマノイド能力に蒸留するスケーラブルな経路を提供する。

English

Scaling humanoid foundation models is bottlenecked by the scarcity of robotic data. While massive egocentric human data offers a scalable alternative, bridging the cross-embodiment chasm remains a fundamental challenge due to kinematic mismatches. We introduce UniT (Unified Latent Action Tokenizer via Visual Anchoring), a framework that establishes a unified physical language for human-to-humanoid transfer. Grounded in the philosophy that heterogeneous kinematics share universal visual consequences, UniT employs a tri-branch cross-reconstruction mechanism: actions predict vision to anchor kinematics to physical outcomes, while vision reconstructs actions to filter out irrelevant visual confounders. Concurrently, a fusion branch synergies these purified modalities into a shared discrete latent space of embodiment-agnostic physical intents. We validate UniT across two paradigms: 1) Policy Learning (VLA-UniT): By predicting these unified tokens, it effectively leverages diverse human data to achieve state-of-the-art data efficiency and robust out-of-distribution (OOD) generalization on both humanoid simulation benchmark and real-world deployments, notably demonstrating zero-shot task transfer. 2) World Modeling (WM-UniT): By aligning cross-embodiment dynamics via unified tokens as conditions, it realizes direct human-to-humanoid action transfer. This alignment ensures that human data seamlessly translates into enhanced action controllability for humanoid video generation. Ultimately, by inducing a highly aligned cross-embodiment representation (empirically verified by t-SNE visualizations revealing the convergence of human and humanoid features into a shared manifold), UniT offers a scalable path to distill vast human knowledge into general-purpose humanoid capabilities.

UniT：ヒューマノイドへのポリシー学習と世界モデリングのための統一的物理言語に向けて

UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling

要旨

Support