X-Humanoid:大规模实现人类视频机器人化以生成仿人视频
X-Humanoid: Robotize Human Videos to Generate Humanoid Videos at Scale
December 4, 2025
作者: Pei Yang, Hai Ci, Yiren Song, Mike Zheng Shou
cs.AI
摘要
具身智能的進步為智能仿人機器人開闢了巨大潛力。然而,視覺-語言-動作模型與世界模型的發展正因大規模多樣化訓練數據的匱乏而嚴重受限。將網絡規模的人類視頻「機器人化」已被證明是策略訓練的有效方案,但現有方法主要是在第一人稱視角視頻上「疊加」機械臂,無法處理第三人稱視頻中複雜的全身運動與場景遮擋,因而難以實現人類動作的機器人化轉換。為解決這一難題,我們提出X-Humanoid生成式視頻編輯框架:通過將強大的Wan 2.2模型改造成視頻到視頻結構,並針對人體到仿人體的轉換任務進行微調。該微調過程需要配對的人類-仿人體視頻數據,為此我們設計了可擴展的數據生成流程,利用虛幻引擎將社區資源轉化為超過17小時的配對合成視頻。基於訓練完成的模型,我們對60小時的Ego-Exo4D視頻進行處理,生成並開放了包含逾360萬幀「機器人化」仿人體畫面的新大規模數據集。定量分析與用戶研究證實了本方法的優越性:69%的用戶認為其在運動連貫性上最佳,62.1%的用戶認可其具身正確性。
English
The advancement of embodied AI has unlocked significant potential for intelligent humanoid robots. However, progress in both Vision-Language-Action (VLA) models and world models is severely hampered by the scarcity of large-scale, diverse training data. A promising solution is to "robotize" web-scale human videos, which has been proven effective for policy training. However, these solutions mainly "overlay" robot arms to egocentric videos, which cannot handle complex full-body motions and scene occlusions in third-person videos, making them unsuitable for robotizing humans. To bridge this gap, we introduce X-Humanoid, a generative video editing approach that adapts the powerful Wan 2.2 model into a video-to-video structure and finetunes it for the human-to-humanoid translation task. This finetuning requires paired human-humanoid videos, so we designed a scalable data creation pipeline, turning community assets into 17+ hours of paired synthetic videos using Unreal Engine. We then apply our trained model to 60 hours of the Ego-Exo4D videos, generating and releasing a new large-scale dataset of over 3.6 million "robotized" humanoid video frames. Quantitative analysis and user studies confirm our method's superiority over existing baselines: 69% of users rated it best for motion consistency, and 62.1% for embodiment correctness.