HumanScale: エゴセントリックな人間のビデオは身体性事前学習において実ロボットデータを凌駕できる

要旨

身体化基盤モデルは、大規模言語モデルと同様にデータスケーリングの恩恵を受けると期待されているが、はるかに厳しいデータボトルネックに直面している。遠隔操作による実ロボット軌跡は、その正確な行動ラベルと身体性の一致から、依然として主要な事前学習ソースであるが、収集コストの高さ、取得の困難さ、行動および環境の多様性の低さにより、そのスケーラビリティは制限されている。これらの制約から、身体化モデルの事前学習において、スケーラブルで大幅に低コストかつより多様な代替手段として、自己中心視点の人間ビデオへの関心が高まっている。しかし、遠隔操作による実ロボットデータと比較したその有効性は十分に検討されていない。この疑問に答えるため、我々は固定された事後学習および評価プロトコルの下で、身体化基盤モデルの事前学習データソースとして自己中心視点の人間ビデオと遠隔操作による実ロボット軌跡を比較する体系的な研究を行う。驚くべきことに、自己中心視点データは、注意深く設計されたフィルタリングおよびラベル付けのパイプラインを通じて処理された場合、モデル事前学習の単なる代替手段として有効であるだけでなく、優れた性能をもたらす可能性があることが判明した。同一量の事前学習データを用いた場合、自己中心視点データで事前学習されたモデルは、実ロボット行動予測における検証損失が24%低減され、分布内および分布外の実ロボットタスク実行においてそれぞれ52.5%および90%高い成功率を達成する。この発見は、身体化基盤モデルに対するスケーラブルなパラダイム、すなわち多様な世界表現を学習するために自己中心視点の人間ビデオで事前学習し、その後少量のラベル付き実ロボットデータを用いて行動空間のアライメントを行うというパラダイムを検証するものである。本研究が、自己中心視点データのより広範な探求を促進し、高コストなロボットデータ収集の前にデータ品質評価の指針を提供することを期待する。

English

Embodied foundation models are expected to benefit from data scaling like large language models, but face a much tighter data bottleneck. Teleoperated real-robot trajectories remain the dominant pretraining source due to their precise action supervision and embodiment alignment, yet their scalability is limited by high collection cost, acquisition difficulty, and low behavioral and environmental diversity. These limitations have sparked interest in egocentric human video as a scalable, substantially lower-cost, and more diverse alternative for embodied model pretraining. However, its effectiveness compared to teleoperated real-robot data remains underexplored. To address this question, we conduct a systematic study comparing egocentric human video and teleoperated real-robot trajectories as pretraining data sources for embodied foundation models, under fixed post-training and validation protocols. Surprisingly, we find that egocentric data, when processed through a carefully designed filtering and labeling pipeline, is not merely a viable substitute for model pretraining but can lead to superior performance. With the same amount of pretraining data, models pretrained on egocentric data achieve a 24% lower validation loss on real-robot action prediction, as well as 52.5% and 90% higher success rates on in-distribution and out-of-distribution real-robot task execution, respectively. This finding verifies a scalable paradigm for embodied foundation models: pretrain on egocentric human video to learn diverse world representations, then adapt with a small amount of labeled real-robot data for action-space alignment. We hope this study encourages broader exploration of egocentric data and offers guidance for data quality assessment before costly robot data collection.