HumanNet: 인간 중심 비디오 학습을 100만 시간으로 확장

초록

임베디드 인텔리전스의 진보는 점점 더 확장 가능한 데이터 인프라에 의존하고 있습니다. 시각과 언어는 인터넷 코퍼스를 통해 확장되었지만, 물리적 상호작용 학습은 크고 다양하며 풍부하게 주석이 달린 인간 활동 데이터의 부족으로 인해 제약을 받고 있습니다. 우리는 HumanNet을 소개합니다. 이는 인간이 물리적 세계와 어떻게 상호작용하는지를 대규모로 포착한 100만 시간 분량의 인간 중심 비디오 코퍼스입니다. HumanNet은 1인칭 시점과 3인칭 시점을 모두 포함하며, 다양한 실제 환경에서 세분화된 활동, 인간-객체 상호작용, 도구 사용, 장기적 행동을 다룹니다. 원시 비디오 외에도 데이터셋은 캡션, 동작 설명, 손 및 신체 관련 신호를 포함한 상호작용 중심의 주석을 제공하여 동작 인식 및 상호작용 인식 학습을 가능하게 합니다. 규모를 넘어, HumanNet은 임베디드 학습을 위한 체계적인 데이터 큐레이션 패러다임을 도입합니다. 여기서 인간 중심 필터링, 시간적 구조화, 시점 다양성, 주석 강화가 일급 설계 원칙으로 취급됩니다. 이 설계는 구조화되지 않은 인터넷 비디오를 표현 학습, 활동 이해, 동작 생성, 인간-로봇 전이를 위한 확장 가능한 기반으로 변환합니다. 우리는 통제된 시각-언어-행동 절제 실험을 통해 이 설계의 가치에 대한 첫 번째 검증을 수행합니다. 고정된 검증 데이터 세트 하에서, HumanNet에서 추출한 1000시간의 자기중심적 비디오로 Qwen VLM 모델을 계속 학습시킨 결과, Magic Cobot의 100시간 실제 로봇 데이터로 계속 학습시킨 것보다 성능이 뛰어났습니다. 이는 자기중심적 인간 비디오가 로봇 데이터에 대한 확장 가능하고 비용 효율적인 대체재가 될 수 있음을 시사합니다. 이 프로젝트를 구축함으로써 우리는 로봇 특화 데이터에만 의존하지 않고 인간 중심 비디오를 사용하여 임베디드 기초 모델을 확장할 기회를 탐구하고자 합니다.

English

Progress in embodied intelligence increasingly depends on scalable data infrastructure. While vision and language have scaled with internet corpora, learning physical interaction remains constrained by the lack of large, diverse, and richly annotated human activity data. We present HumanNet, a one-million-hour human-centric video corpus that captures how humans interact with the physical world at scale. HumanNet spans both first-person and third-person perspectives and covers fine-grained activities, human-object interactions, tool use, and long-horizon behaviors across diverse real-world environments. Beyond raw video, the dataset provides interaction-centric annotations, including captions, motion descriptions, and hand and body-related signals, enabling motion-aware and interaction-aware learning. Beyond scale, HumanNet introduces a systematic data curation paradigm for embodied learning, where human-centric filtering, temporal structuring, viewpoint diversity, and annotation enrichment are treated as first-class design principles. This design transforms unstructured internet video into a scalable substrate for representation learning, activity understanding, motion generation, and human-to-robot transfer. We conduct a first-step validation on the value of this design through controlled vision-language-action ablation: under a fixed set of validation data, continued training from the Qwen VLM model with 1000 hours of egocentric video drawn from HumanNet surpasses the continued training with 100 hours of real-robot data from Magic Cobot, indicating that egocentric human video could be a scalable and cost-effective substitute for robot data. By building this project, we aim to explore the opportunity to scale embodied foundation models using human-centric videos, rather than relying solely on robot-specific data.