HumanNet：人間中心の動画学習を100万時間に拡大

要旨

具現化知能の進歩は、スケーラブルなデータ基盤に依存する度合いが高まっている。視覚と言語はインターネット上のコーパスによって大規模化が進んだ一方、物理的な相互作用の学習は、大規模で多様かつ豊富にアノテーションされた人間の活動データが不足しているため、依然として制約を受けている。本稿では、人間が物理世界と大規模に対話する様子を捉えた、100万時間に及ぶ人間中心のビデオコーパス「HumanNet」を提案する。HumanNetは、一人称視点と三人称視点の両方を網羅し、多様な実世界環境における細粒度の活動、人間と物体のインタラクション、道具の使用、長期にわたる行動をカバーする。生のビデオに加えて、キャプション、動作記述、手や身体に関する信号など、インタラクション中心のアノテーションを提供し、動作認識とインタラクション認識の学習を可能にする。規模だけでなく、HumanNetは具現化学習のための体系的なデータキュレーションパラダイムを導入しており、人間中心のフィルタリング、時間的構造化、視点の多様性、アノテーションの充実化を第一級の設計原理として扱う。この設計により、構造化されていないインターネット上のビデオが、表現学習、活動理解、動作生成、人間からロボットへの転移のためのスケーラブルな基盤へと変わる。我々は、制御された視覚-言語-行動アブレーション実験を通じて、この設計の価値について初期的な検証を行う。固定された検証データセットのもとで、HumanNetから抽出した1000時間の一人称視点ビデオを用いてQwen VLMモデルの継続学習を行った結果、Magic Cobotからの100時間の実ロボットデータを用いた継続学習を上回る性能を示した。これは、一人称視点の人間ビデオがロボットデータのスケーラブルで費用対効果の高い代替手段となり得ることを示唆している。本プロジェクトを通じて、ロボット固有のデータのみに依存するのではなく、人間中心のビデオを用いて具現化基盤モデルを大規模化する可能性を探求することを目指す。

English

Progress in embodied intelligence increasingly depends on scalable data infrastructure. While vision and language have scaled with internet corpora, learning physical interaction remains constrained by the lack of large, diverse, and richly annotated human activity data. We present HumanNet, a one-million-hour human-centric video corpus that captures how humans interact with the physical world at scale. HumanNet spans both first-person and third-person perspectives and covers fine-grained activities, human-object interactions, tool use, and long-horizon behaviors across diverse real-world environments. Beyond raw video, the dataset provides interaction-centric annotations, including captions, motion descriptions, and hand and body-related signals, enabling motion-aware and interaction-aware learning. Beyond scale, HumanNet introduces a systematic data curation paradigm for embodied learning, where human-centric filtering, temporal structuring, viewpoint diversity, and annotation enrichment are treated as first-class design principles. This design transforms unstructured internet video into a scalable substrate for representation learning, activity understanding, motion generation, and human-to-robot transfer. We conduct a first-step validation on the value of this design through controlled vision-language-action ablation: under a fixed set of validation data, continued training from the Qwen VLM model with 1000 hours of egocentric video drawn from HumanNet surpasses the continued training with 100 hours of real-robot data from Magic Cobot, indicating that egocentric human video could be a scalable and cost-effective substitute for robot data. By building this project, we aim to explore the opportunity to scale embodied foundation models using human-centric videos, rather than relying solely on robot-specific data.

HumanNet：人間中心の動画学習を100万時間に拡大

HumanNet: Scaling Human-centric Video Learning to One Million Hours

要旨

Support