HumanNet: 将以人为中心的视频学习扩展至一百万小时

摘要

具身智能的进展日益依赖于可扩展的数据基础设施。尽管视觉和语言领域已借助互联网语料库实现了规模扩展，但物理交互的学习仍受限于缺乏大规模、多样化且标注丰富的人类活动数据。为此，我们提出HumanNet——一个一百万小时的人类中心视频语料库，它捕获了人类与物理世界的大规模交互过程。HumanNet涵盖第一人称和第三人称视角，包含细粒度活动、人-物交互、工具使用以及跨多样真实环境的长期行为。除原始视频外，该数据集还提供以交互为中心的标注，包括字幕、动作描述以及手部和身体相关信号，从而支持运动感知和交互感知学习。超越规模本身，HumanNet引入了一种面向具身学习的系统性数据整理范式，将人类中心过滤、时间结构组织、视角多样性以及标注丰富性作为首要设计原则。这一设计将非结构化的互联网视频转化为用于表征学习、活动理解、运动生成以及人-机器人迁移的可扩展基础资料。我们通过一项受控的视觉-语言-动作消融实验首次验证了这一设计的价值：在固定验证数据集下，基于Qwen VLM模型，使用HumanNet中提取的1000小时第一人称视频进行连续训练，其效果超越了使用Magic Cobot中100小时真实机器人数据进行的连续训练，这表明第一人称人类视频可成为机器人数据的一种可扩展且成本效益高的替代方案。通过构建该项目，我们旨在探索利用人类中心视频（而非仅依赖机器人专属数据）来扩展具身基础模型的可能性。

English

Progress in embodied intelligence increasingly depends on scalable data infrastructure. While vision and language have scaled with internet corpora, learning physical interaction remains constrained by the lack of large, diverse, and richly annotated human activity data. We present HumanNet, a one-million-hour human-centric video corpus that captures how humans interact with the physical world at scale. HumanNet spans both first-person and third-person perspectives and covers fine-grained activities, human-object interactions, tool use, and long-horizon behaviors across diverse real-world environments. Beyond raw video, the dataset provides interaction-centric annotations, including captions, motion descriptions, and hand and body-related signals, enabling motion-aware and interaction-aware learning. Beyond scale, HumanNet introduces a systematic data curation paradigm for embodied learning, where human-centric filtering, temporal structuring, viewpoint diversity, and annotation enrichment are treated as first-class design principles. This design transforms unstructured internet video into a scalable substrate for representation learning, activity understanding, motion generation, and human-to-robot transfer. We conduct a first-step validation on the value of this design through controlled vision-language-action ablation: under a fixed set of validation data, continued training from the Qwen VLM model with 1000 hours of egocentric video drawn from HumanNet surpasses the continued training with 100 hours of real-robot data from Magic Cobot, indicating that egocentric human video could be a scalable and cost-effective substitute for robot data. By building this project, we aim to explore the opportunity to scale embodied foundation models using human-centric videos, rather than relying solely on robot-specific data.

HumanNet: 将以人为中心的视频学习扩展至一百万小时

HumanNet: Scaling Human-centric Video Learning to One Million Hours

摘要

Support