UniT: 휴머노이드 정책 학습 및 세계 모델링을 위한 통합 물리 언어 구축

초록

휴머노이드 기초 모델의 규모 확장은 로봇 데이터의 부족으로 인해 병목 현상에 직면해 있습니다. 방대한 1인칭 시점 인간 데이터는 확장 가능한 대안을 제공하지만, 운동학적 불일치로 인한 차원 간 격차는 근본적인 과제로 남아 있습니다. 본 연구에서는 인간-휴머노이드 전이를 위한 통합 물리 언어를 구축하는 프레임워크인 UniT(시각적 정착을 통한 통합 잠재 행동 토큰화기)를 소개합니다. 이종 운동학이 보편적인 시각적 결과를 공유한다는 철학에 기반하여, UniT는 삼중 분기 교차 재구성 메커니즘을 활용합니다: 행동은 시각을 예측하여 운동학을 물리적 결과에 정착시키고, 시각은 행동을 재구성하여 관련 없는 시각적 혼란 요인을 걸러냅니다. 동시에, 융합 분기는 이러한 정제된 양상들을 체화 불가지론적 물리 의도를 공유하는 이산 잠재 공간으로 상승적으로 통합합니다. 우리는 UniT를 두 가지 패러다임에서 검증합니다: 1) 정책 학습(VLA-UniT): 이러한 통합 토큰을 예측함으로써 다양한 인간 데이터를 효과적으로 활용하여 휴머노이드 시뮬레이션 벤치마크와 실제 배치 모두에서 최첨단 데이터 효율성과 강력한 분포 외 일반화를 달성하며, 특히 제로샷 작업 전이를 입증합니다. 2) 세계 모델링(WM-UniT): 통합 토큰을 조건으로 하는 교차 체화 역학 정렬을 통해 인간-휴머노이드 직접 행동 전이를 실현합니다. 이 정렬은 인간 데이터가 휴머노이드 비디오 생성 향상을 위한 향상된 행동 제어 가능성으로 원활하게 변환되도록 보장합니다. 궁극적으로, 높은 정렬도를 갖는 교차 체화 표현(인간과 휴머노이드 특징이 공유 매니폴드로 수렴되는 것을 보여주는 t-SNE 시각화로 경험적 검증)을 유도함으로써, UniT는 방대한 인간 지식을 범용 휴머노이드 능력으로 증류하는 확장 가능한 경로를 제시합니다.

English

Scaling humanoid foundation models is bottlenecked by the scarcity of robotic data. While massive egocentric human data offers a scalable alternative, bridging the cross-embodiment chasm remains a fundamental challenge due to kinematic mismatches. We introduce UniT (Unified Latent Action Tokenizer via Visual Anchoring), a framework that establishes a unified physical language for human-to-humanoid transfer. Grounded in the philosophy that heterogeneous kinematics share universal visual consequences, UniT employs a tri-branch cross-reconstruction mechanism: actions predict vision to anchor kinematics to physical outcomes, while vision reconstructs actions to filter out irrelevant visual confounders. Concurrently, a fusion branch synergies these purified modalities into a shared discrete latent space of embodiment-agnostic physical intents. We validate UniT across two paradigms: 1) Policy Learning (VLA-UniT): By predicting these unified tokens, it effectively leverages diverse human data to achieve state-of-the-art data efficiency and robust out-of-distribution (OOD) generalization on both humanoid simulation benchmark and real-world deployments, notably demonstrating zero-shot task transfer. 2) World Modeling (WM-UniT): By aligning cross-embodiment dynamics via unified tokens as conditions, it realizes direct human-to-humanoid action transfer. This alignment ensures that human data seamlessly translates into enhanced action controllability for humanoid video generation. Ultimately, by inducing a highly aligned cross-embodiment representation (empirically verified by t-SNE visualizations revealing the convergence of human and humanoid features into a shared manifold), UniT offers a scalable path to distill vast human knowledge into general-purpose humanoid capabilities.

UniT: 휴머노이드 정책 학습 및 세계 모델링을 위한 통합 물리 언어 구축

UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling

초록

Support