全超人形:基于无配对自适应的跨形态流式视频生成
OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation
May 12, 2026
作者: Yiren Song, Xiyao Deng, Pei Yang, Yihan Wang, Mike Zheng Shou
cs.AI
摘要
跨具身视频生成旨在在不同的人形具身形态之间迁移运动,例如从人类到机器人以及机器人之间的运动迁移,从而为具身智能实现可扩展的数据生成。该领域的一个主要挑战在于,运动动力学在不同具身形态之间部分可迁移,而外观和形态则保持具身特异性。现有方法往往将这些因素纠缠在一起,且许多方法需要针对每个目标具身形态提供配对数据,这限制了向新型机器人的可扩展性。我们提出OmniHumanoid框架,将可迁移运动学习与具身特异性适配进行分解。该方法从跨多个具身形态的运动对齐配对视频中学习共享的运动迁移模型,同时仅通过未配对视频和轻量级具身特异性适配器适应新具身形态。为减少运动迁移与具身适配之间的干扰,我们进一步引入分支隔离注意力设计,将运动条件化与具身特异性调制相分离。此外,我们构建了一个合成跨具身数据集,其中包含跨不同人形资产、场景和视点渲染的运动对齐配对视频。在合成和真实世界基准上的实验表明,OmniHumanoid实现了强大的运动保真度和具身一致性,同时无需重新训练共享运动模型即可实现对未见人形具身形态的可扩展适配。
English
Cross-embodiment video generation aims to transfer motions across different humanoid embodiments, such as human-to-robot and robot-to-robot, enabling scalable data generation for embodied intelligence. A major challenge in this setting is that motion dynamics are partly transferable across embodiments, whereas appearance and morphology remain embodiment-specific. Existing approaches often entangle these factors, and many require paired data for every target embodiment, which limits scalability to new robots. We present OmniHumanoid, a framework that factorizes transferable motion learning and embodiment-specific adaptation. Our method learns a shared motion transfer model from motion-aligned paired videos spanning multiple embodiments, while adapting to a new embodiment using only unpaired videos through lightweight embodiment-specific adapters. To reduce interference between motion transfer and embodiment adaptation, we further introduce a branch-isolated attention design that separates motion conditioning from embodiment-specific modulation. In addition, we construct a synthetic cross-embodiment dataset with motion-aligned paired videos rendered across diverse humanoid assets, scenes, and viewpoints. Experiments on both synthetic and real-world benchmarks show that OmniHumanoid achieves strong motion fidelity and embodiment consistency, while enabling scalable adaptation to unseen humanoid embodiments without retraining the shared motion model.