HumanNet:將以人為中心的影片學習擴展至一百萬小時
HumanNet: Scaling Human-centric Video Learning to One Million Hours
May 7, 2026
作者: Yufan Deng, Daquan Zhou
cs.AI
摘要
具身智能的進展日益依賴於可擴展的數據基礎設施。雖然視覺與語言已能透過網路語料庫實現規模化發展,但物理互動的學習仍受限於缺乏大規模、多樣化且富含注釋的人類活動數據。我們提出 HumanNet——一個涵蓋百萬小時的以人為中心的影片語料庫,大規模捕捉人類與物理世界的互動方式。HumanNet 涵蓋第一人稱與第三人稱視角,並涵蓋細粒度活動、人-物互動、工具使用以及跨越多種真實環境的長時程行為。除了原始影片之外,該數據集還提供以互動為中心的注釋,包括描述文字、動作描述,以及手部與身體相關信號,從而支援具有運動感知與互動感知的學習。除了規模之外,HumanNet 引入了一套系統性的數據整理範式,專為具身學習設計,將以人為中心的過濾、時間結構化、視角多樣性以及注釋豐富化視為第一級設計原則。此設計將非結構化的網路影片轉化為可擴展的基礎,用於表徵學習、活動理解、動作生成以及人-機器人遷移。我們透過受控的視覺-語言-動作消融實驗,對該設計的價值進行了初步驗證:在固定的驗證數據集條件下,使用從 HumanNet 中抽取的 1000 小時第一人稱影片對 Qwen VLM 模型進行持續訓練,其表現超越了使用 Magic Cobot 的 100 小時真實機器人數據進行的持續訓練,這表明第一人稱人類影片可能成為機器人數據的可擴展且具成本效益的替代方案。透過構建此專案,我們旨在探索利用以人為中心的影片來規模化發展具身基礎模型的機會,而非僅依賴於機器人專用數據。
English
Progress in embodied intelligence increasingly depends on scalable data infrastructure. While vision and language have scaled with internet corpora, learning physical interaction remains constrained by the lack of large, diverse, and richly annotated human activity data. We present HumanNet, a one-million-hour human-centric video corpus that captures how humans interact with the physical world at scale. HumanNet spans both first-person and third-person perspectives and covers fine-grained activities, human-object interactions, tool use, and long-horizon behaviors across diverse real-world environments. Beyond raw video, the dataset provides interaction-centric annotations, including captions, motion descriptions, and hand and body-related signals, enabling motion-aware and interaction-aware learning. Beyond scale, HumanNet introduces a systematic data curation paradigm for embodied learning, where human-centric filtering, temporal structuring, viewpoint diversity, and annotation enrichment are treated as first-class design principles. This design transforms unstructured internet video into a scalable substrate for representation learning, activity understanding, motion generation, and human-to-robot transfer. We conduct a first-step validation on the value of this design through controlled vision-language-action ablation: under a fixed set of validation data, continued training from the Qwen VLM model with 1000 hours of egocentric video drawn from HumanNet surpasses the continued training with 100 hours of real-robot data from Magic Cobot, indicating that egocentric human video could be a scalable and cost-effective substitute for robot data. By building this project, we aim to explore the opportunity to scale embodied foundation models using human-centric videos, rather than relying solely on robot-specific data.