HumanScale: 第一人称人类视频在具身预训练中可优于真实机器人数据

摘要

具身基础模型有望像大语言模型一样受益于数据扩展，但面临更为严重的数据瓶颈。遥操作真实机器人轨迹因其精确的动作监督和具身对齐能力，仍是主要的预训练数据来源，但其可扩展性受到高采集成本、获取难度大以及行为与环境多样性低的限制。这些局限性激发了人们对以自我为中心的人类视频的兴趣，这种视频作为一种可扩展、成本极低且多样性更高的替代方案，可用于具身模型预训练。然而，与遥操作真实机器人数据相比，其有效性尚未得到充分探索。为解答这一问题，我们开展了一项系统性研究，在固定的后训练和验证协议下，比较以自我为中心的人类视频和遥操作真实机器人轨迹作为具身基础模型预训练数据源的效果。令人惊讶的是，我们发现，经过精心设计的过滤和标注流程处理后，以自我为中心的数据不仅是模型预训练的可行替代品，还能带来更优的性能。在相同预训练数据量下，基于以自我为中心的数据预训练的模型，在真实机器人动作预测上的验证损失降低了24%，在分布内和分布外的真实机器人任务执行中，成功率分别提高了52.5%和90%。这一发现验证了具身基础模型的一种可扩展范式：先利用以自我为中心的人类视频进行预训练，学习多样的世界表征，然后通过少量标注的真实机器人数据进行适配，实现动作空间对齐。我们希望这项研究能鼓励更广泛地探索以自我为中心的数据，并为在昂贵机器人数据采集之前进行数据质量评估提供指导。

English

Embodied foundation models are expected to benefit from data scaling like large language models, but face a much tighter data bottleneck. Teleoperated real-robot trajectories remain the dominant pretraining source due to their precise action supervision and embodiment alignment, yet their scalability is limited by high collection cost, acquisition difficulty, and low behavioral and environmental diversity. These limitations have sparked interest in egocentric human video as a scalable, substantially lower-cost, and more diverse alternative for embodied model pretraining. However, its effectiveness compared to teleoperated real-robot data remains underexplored. To address this question, we conduct a systematic study comparing egocentric human video and teleoperated real-robot trajectories as pretraining data sources for embodied foundation models, under fixed post-training and validation protocols. Surprisingly, we find that egocentric data, when processed through a carefully designed filtering and labeling pipeline, is not merely a viable substitute for model pretraining but can lead to superior performance. With the same amount of pretraining data, models pretrained on egocentric data achieve a 24% lower validation loss on real-robot action prediction, as well as 52.5% and 90% higher success rates on in-distribution and out-of-distribution real-robot task execution, respectively. This finding verifies a scalable paradigm for embodied foundation models: pretrain on egocentric human video to learn diverse world representations, then adapt with a small amount of labeled real-robot data for action-space alignment. We hope this study encourages broader exploration of egocentric data and offers guidance for data quality assessment before costly robot data collection.