多様性はスケーラブルなロボット操作に必要なすべてなのか？

要旨

データスケーリングは、自然言語処理（NLP）やコンピュータビジョン（CV）における基盤モデルの顕著な成功を牽引してきたが、ロボット操作における効果的なデータスケーリングの原則はまだ十分に理解されていない。本研究では、タスク（何をするか）、エンボディメント（どのロボットを使用するか）、エキスパート（誰がデモンストレーションするか）という3つの重要な次元を検証することで、ロボット学習におけるデータ多様性の微妙な役割を調査し、「多様性が高いほど良い」という従来の直感に挑戦する。様々なロボットプラットフォームでの広範な実験を通じて、以下のことを明らかにした：(1) タスク多様性は、個々のタスクのデモンストレーション量よりも重要であり、多様な事前学習タスクから新しい下流シナリオへの転移に有益である；(2) クロスエンボディメント転移のためのマルチエンボディメント事前学習データはオプションであり、高品質なシングルエンボディメントデータで訓練されたモデルは、異なるプラットフォームに効率的に転移でき、マルチエンボディメント事前学習モデルよりもファインチューニング中のスケーリング特性が望ましい；(3) 個々の操作の好みや人間のデモンストレーションにおける確率的な変動に起因するエキスパート多様性は、ポリシー学習にとって混乱を招く可能性があり、速度の多峰性が主要な要因として浮上する。この洞察に基づき、速度の曖昧さを軽減するための分布デビアシング手法を提案し、その結果得られたGO-1-Proは、事前学習データを2.5倍使用した場合に相当する15%の大幅な性能向上を達成した。これらの知見は、ロボット操作データセットを効果的にスケーリングする方法について新たな視点を提供し、実践的な指針を示すものである。

English

Data scaling has driven remarkable success in foundation models for Natural Language Processing (NLP) and Computer Vision (CV), yet the principles of effective data scaling in robotic manipulation remain insufficiently understood. In this work, we investigate the nuanced role of data diversity in robot learning by examining three critical dimensions-task (what to do), embodiment (which robot to use), and expert (who demonstrates)-challenging the conventional intuition of "more diverse is better". Throughout extensive experiments on various robot platforms, we reveal that (1) task diversity proves more critical than per-task demonstration quantity, benefiting transfer from diverse pre-training tasks to novel downstream scenarios; (2) multi-embodiment pre-training data is optional for cross-embodiment transfer-models trained on high-quality single-embodiment data can efficiently transfer to different platforms, showing more desirable scaling property during fine-tuning than multi-embodiment pre-trained models; and (3) expert diversity, arising from individual operational preferences and stochastic variations in human demonstrations, can be confounding to policy learning, with velocity multimodality emerging as a key contributing factor. Based on this insight, we propose a distribution debiasing method to mitigate velocity ambiguity, the yielding GO-1-Pro achieves substantial performance gains of 15%, equivalent to using 2.5 times pre-training data. Collectively, these findings provide new perspectives and offer practical guidance on how to scale robotic manipulation datasets effectively.