OmniHumanoid:基於無配對適配的串流式跨體態影片生成
OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation
May 12, 2026
作者: Yiren Song, Xiyao Deng, Pei Yang, Yihan Wang, Mike Zheng Shou
cs.AI
摘要
跨本體影片生成旨在將動作遷移應用於不同的人形本體之間,例如從人類到機器人以及機器人之間,從而為具身智能實現可擴展的數據生成。此設定中的一項主要挑戰在於:動作動態在不同本體之間部分可轉移,而外觀與形態則仍具有本體特異性。現有方法常將這些因素糾纏在一起,且許多方法需要針對每個目標本體提供配對數據,這限制了其對新機器人的可擴展性。我們提出 OmniHumanoid,一個將可轉移動作學習與本體特異性適應進行分解的框架。該方法從涵蓋多種本體的運動對齊配對影片中學習共享的動作遷移模型,同時僅透過未配對影片並借助輕量化的本體特定適配器來適應新本體。為減少動作遷移與本體適應之間的干擾,我們進一步引入分支隔離注意力設計,將動作條件化與本體特定調製分離。此外,我們構建了一個合成的跨本體數據集,其中包含在多種人形資產、場景和視角下渲染的運動對齊配對影片。在合成與真實世界基準上的實驗表明,OmniHumanoid 在無需重新訓練共享動作模型的情況下,能實現強大的動作保真度與本體一致性,同時支持對未見過的人形本體進行可擴展的適應。
English
Cross-embodiment video generation aims to transfer motions across different humanoid embodiments, such as human-to-robot and robot-to-robot, enabling scalable data generation for embodied intelligence. A major challenge in this setting is that motion dynamics are partly transferable across embodiments, whereas appearance and morphology remain embodiment-specific. Existing approaches often entangle these factors, and many require paired data for every target embodiment, which limits scalability to new robots. We present OmniHumanoid, a framework that factorizes transferable motion learning and embodiment-specific adaptation. Our method learns a shared motion transfer model from motion-aligned paired videos spanning multiple embodiments, while adapting to a new embodiment using only unpaired videos through lightweight embodiment-specific adapters. To reduce interference between motion transfer and embodiment adaptation, we further introduce a branch-isolated attention design that separates motion conditioning from embodiment-specific modulation. In addition, we construct a synthetic cross-embodiment dataset with motion-aligned paired videos rendered across diverse humanoid assets, scenes, and viewpoints. Experiments on both synthetic and real-world benchmarks show that OmniHumanoid achieves strong motion fidelity and embodiment consistency, while enabling scalable adaptation to unseen humanoid embodiments without retraining the shared motion model.