HumanScale: 자아중심 인간 비디오가 체화된 사전학습에서 실제 로봇 데이터를 능가할 수 있다

초록

구현 기반 파운데이션 모델은 대규모 언어 모델과 마찬가지로 데이터 스케일링의 이점을 누릴 것으로 기대되지만, 훨씬 더 심각한 데이터 병목 현상에 직면해 있다. 원격 조작 실제 로봇 궤적은 정밀한 행동 감독과 구현 정렬 덕분에 여전히 지배적인 사전 학습 소스로 남아 있지만, 높은 수집 비용, 획득 난이도, 낮은 행동 및 환경 다양성으로 인해 확장성이 제한적이다. 이러한 한계로 인해 구현 모델 사전 학습을 위한 확장 가능하고 비용이 훨씬 저렴하며 더 다양한 대안으로서 에고센트릭 인간 비디오에 대한 관심이 촉발되었다. 그러나 원격 조작 실제 로봇 데이터와 비교한 그 효과성은 아직 충분히 탐구되지 않았다. 이 질문에 답하기 위해, 우리는 고정된 후속 학습 및 검증 프로토콜 하에서 구현 기반 파운데이션 모델의 사전 학습 데이터 소스로서 에고센트릭 인간 비디오와 원격 조작 실제 로봇 궤적을 비교하는 체계적인 연구를 수행한다. 놀랍게도, 우리는 에고센트릭 데이터가 신중하게 설계된 필터링 및 레이블링 파이프라인을 통해 처리될 때, 모델 사전 학습의 실현 가능한 대체재일 뿐만 아니라 우수한 성능으로 이어질 수 있음을 발견한다. 동일한 양의 사전 학습 데이터로, 에고센트릭 데이터로 사전 학습된 모델은 실제 로봇 행동 예측에서 24% 낮은 검증 손실을 달성하고, 분포 내 및 분포 외 실제 로봇 작업 실행에서 각각 52.5% 및 90% 더 높은 성공률을 달성한다. 이 발견은 구현 기반 파운데이션 모델을 위한 확장 가능한 패러다임을 입증한다: 다양한 세계 표현을 학습하기 위해 에고센트릭 인간 비디오로 사전 학습한 후, 행동 공간 정렬을 위해 소량의 레이블링된 실제 로봇 데이터로 적응하는 것이다. 우리는 이 연구가 에고센트릭 데이터에 대한 더 광범위한 탐구를 장려하고, 비용이 많이 드는 로봇 데이터 수집 전에 데이터 품질 평가에 대한 지침을 제공하기를 기대한다.

English

Embodied foundation models are expected to benefit from data scaling like large language models, but face a much tighter data bottleneck. Teleoperated real-robot trajectories remain the dominant pretraining source due to their precise action supervision and embodiment alignment, yet their scalability is limited by high collection cost, acquisition difficulty, and low behavioral and environmental diversity. These limitations have sparked interest in egocentric human video as a scalable, substantially lower-cost, and more diverse alternative for embodied model pretraining. However, its effectiveness compared to teleoperated real-robot data remains underexplored. To address this question, we conduct a systematic study comparing egocentric human video and teleoperated real-robot trajectories as pretraining data sources for embodied foundation models, under fixed post-training and validation protocols. Surprisingly, we find that egocentric data, when processed through a carefully designed filtering and labeling pipeline, is not merely a viable substitute for model pretraining but can lead to superior performance. With the same amount of pretraining data, models pretrained on egocentric data achieve a 24% lower validation loss on real-robot action prediction, as well as 52.5% and 90% higher success rates on in-distribution and out-of-distribution real-robot task execution, respectively. This finding verifies a scalable paradigm for embodied foundation models: pretrain on egocentric human video to learn diverse world representations, then adapt with a small amount of labeled real-robot data for action-space alignment. We hope this study encourages broader exploration of egocentric data and offers guidance for data quality assessment before costly robot data collection.