다양성이 확장 가능한 로봇 매니퓰레이션에 필요한 전부인가?

초록

데이터 스케일링은 자연어 처리(NLP)와 컴퓨터 비전(CV) 분야의 기초 모델에서 놀라운 성공을 이끌어냈지만, 로봇 매니퓰레이션에서 효과적인 데이터 스케일링의 원칙은 아직 충분히 이해되지 않고 있습니다. 본 연구에서는 작업(무엇을 할 것인가), 구현체(어떤 로봇을 사용할 것인가), 전문가(누가 시연할 것인가)라는 세 가지 핵심 차원을 검토함으로써 로봇 학습에서 데이터 다양성의 미묘한 역할을 탐구하며, "다양성이 많을수록 좋다"는 기존의 직관에 도전합니다. 다양한 로봇 플랫폼에서의 광범위한 실험을 통해 우리는 (1) 작업 다양성이 작업당 시연 횟수보다 더 중요하며, 이는 다양한 사전 학습 작업에서 새로운 다운스트림 시나리오로의 전이에 유리하다는 점, (2) 다중 구현체 사전 학습 데이터는 크로스 구현체 전이에 필수적이지 않으며, 고품질의 단일 구현체 데이터로 훈련된 모델이 다른 플랫폼으로 효율적으로 전이될 수 있고, 다중 구현체 사전 학습 모델보다 미세 조정 중 더 바람직한 스케일링 특성을 보인다는 점, (3) 전문가 다양성은 개인의 운영 선호도와 인간 시연의 확률적 변동으로 인해 정책 학습에 혼란을 줄 수 있으며, 속도 다중성이 주요 요인으로 나타난다는 점을 밝혔습니다. 이러한 통찰을 바탕으로, 우리는 속도 모호성을 완화하기 위한 분포 편향 제거 방법을 제안하며, 이를 통해 GO-1-Pro는 15%의 상당한 성능 향상을 달성했고, 이는 2.5배의 사전 학습 데이터를 사용한 것과 동등한 효과를 보였습니다. 종합적으로, 이러한 발견들은 로봇 매니퓰레이션 데이터셋을 효과적으로 확장하는 방법에 대한 새로운 관점과 실질적인 지침을 제공합니다.

English

Data scaling has driven remarkable success in foundation models for Natural Language Processing (NLP) and Computer Vision (CV), yet the principles of effective data scaling in robotic manipulation remain insufficiently understood. In this work, we investigate the nuanced role of data diversity in robot learning by examining three critical dimensions-task (what to do), embodiment (which robot to use), and expert (who demonstrates)-challenging the conventional intuition of "more diverse is better". Throughout extensive experiments on various robot platforms, we reveal that (1) task diversity proves more critical than per-task demonstration quantity, benefiting transfer from diverse pre-training tasks to novel downstream scenarios; (2) multi-embodiment pre-training data is optional for cross-embodiment transfer-models trained on high-quality single-embodiment data can efficiently transfer to different platforms, showing more desirable scaling property during fine-tuning than multi-embodiment pre-trained models; and (3) expert diversity, arising from individual operational preferences and stochastic variations in human demonstrations, can be confounding to policy learning, with velocity multimodality emerging as a key contributing factor. Based on this insight, we propose a distribution debiasing method to mitigate velocity ambiguity, the yielding GO-1-Pro achieves substantial performance gains of 15%, equivalent to using 2.5 times pre-training data. Collectively, these findings provide new perspectives and offer practical guidance on how to scale robotic manipulation datasets effectively.