ChatPaper.aiChatPaper

X-Humanoid:将人类视频机器人化以规模化生成仿人视频

X-Humanoid: Robotize Human Videos to Generate Humanoid Videos at Scale

December 4, 2025
作者: Pei Yang, Hai Ci, Yiren Song, Mike Zheng Shou
cs.AI

摘要

具身智能的进步为智能仿人机器人开辟了巨大潜力。然而,视觉-语言-动作模型与世界模型的发展均受制于大规模多样化训练数据的稀缺。将网络规模的人类视频"机器人化"被证明是策略训练的有效解决方案,但现有方法主要是在第一人称视角视频上"叠加"机械臂,无法处理第三人称视频中复杂的全身运动与场景遮挡,因而难以实现人类动作的机器人化转换。为突破这一局限,我们提出X-Humanoid生成式视频编辑方法:通过将强大的Wan 2.2模型适配为视频到视频结构,并针对人形转换任务进行微调。该微调需要配对的人类-仿人视频数据,为此我们设计了可扩展的数据生成流程,利用虚幻引擎将社区资源转化为17小时以上的配对合成视频。基于训练完成的模型,我们对60小时Ego-Exo4D视频进行处理,生成并发布了包含超360万帧"机器人化"仿人视频的大规模数据集。定量分析与用户研究证实了本方法的优越性:69%的用户认为其运动一致性最佳,62.1%的用户认可其具身正确性最高。
English
The advancement of embodied AI has unlocked significant potential for intelligent humanoid robots. However, progress in both Vision-Language-Action (VLA) models and world models is severely hampered by the scarcity of large-scale, diverse training data. A promising solution is to "robotize" web-scale human videos, which has been proven effective for policy training. However, these solutions mainly "overlay" robot arms to egocentric videos, which cannot handle complex full-body motions and scene occlusions in third-person videos, making them unsuitable for robotizing humans. To bridge this gap, we introduce X-Humanoid, a generative video editing approach that adapts the powerful Wan 2.2 model into a video-to-video structure and finetunes it for the human-to-humanoid translation task. This finetuning requires paired human-humanoid videos, so we designed a scalable data creation pipeline, turning community assets into 17+ hours of paired synthetic videos using Unreal Engine. We then apply our trained model to 60 hours of the Ego-Exo4D videos, generating and releasing a new large-scale dataset of over 3.6 million "robotized" humanoid video frames. Quantitative analysis and user studies confirm our method's superiority over existing baselines: 69% of users rated it best for motion consistency, and 62.1% for embodiment correctness.
PDF11December 13, 2025