OmniHumanoid: ペアフリー適応によるストリーミング型クロスエンボディメント動画生成

要旨

クロスエンボディメント動画生成は、人間からロボット、ロボット間など、異なるヒューマノイド身体性間での動作転移を目的とし、身体性知能のためのスケーラブルなデータ生成を可能にする。この設定における主要な課題は、動作ダイナミクスが身体性間で部分的に転移可能である一方、外見や形態は身体性に固有である点にある。既存のアプローチではこれらの因子がしばしば絡み合い、多くの手法は対象の身体性ごとにペアデータを必要とするため、新たなロボットへのスケーラビリティが制限される。本稿では、転移可能な動作学習と身体性固有の適応を分解するフレームワーク、OmniHumanoidを提案する。本手法は、複数の身体性にわたる動作整合済みペア動画から共有動作転移モデルを学習しつつ、軽量な身体性固有アダプタを通じて非ペア動画のみを用いて新たな身体性に適応する。さらに、動作転移と身体性適応の間の干渉を低減するため、動作条件付けと身体性固有の変調を分離するブランチ分離型アテンション設計を導入する。加えて、多様なヒューマノイドアセット、シーン、視点でレンダリングされた動作整合済みペア動画からなる合成クロスエンボディメントデータセットを構築する。合成および実世界のベンチマークでの実験により、OmniHumanoidは高い動作忠実度と身体性一貫性を達成し、共有動作モデルを再学習することなく未知のヒューマノイド身体性へのスケーラブルな適応を可能にすることを示す。

English

Cross-embodiment video generation aims to transfer motions across different humanoid embodiments, such as human-to-robot and robot-to-robot, enabling scalable data generation for embodied intelligence. A major challenge in this setting is that motion dynamics are partly transferable across embodiments, whereas appearance and morphology remain embodiment-specific. Existing approaches often entangle these factors, and many require paired data for every target embodiment, which limits scalability to new robots. We present OmniHumanoid, a framework that factorizes transferable motion learning and embodiment-specific adaptation. Our method learns a shared motion transfer model from motion-aligned paired videos spanning multiple embodiments, while adapting to a new embodiment using only unpaired videos through lightweight embodiment-specific adapters. To reduce interference between motion transfer and embodiment adaptation, we further introduce a branch-isolated attention design that separates motion conditioning from embodiment-specific modulation. In addition, we construct a synthetic cross-embodiment dataset with motion-aligned paired videos rendered across diverse humanoid assets, scenes, and viewpoints. Experiments on both synthetic and real-world benchmarks show that OmniHumanoid achieves strong motion fidelity and embodiment consistency, while enabling scalable adaptation to unseen humanoid embodiments without retraining the shared motion model.