GR-3 技術レポート

要旨

我々は汎用ロボットポリシーの構築に向けた最近の進展、すなわちGR-3の開発について報告する。GR-3は大規模な視覚-言語-行動（VLA）モデルであり、新しい物体、環境、抽象概念を含む指示に対する優れた汎化能力を示す。さらに、最小限の人間軌跡データで効率的にファインチューニング可能であり、新しい設定への迅速かつ低コストな適応を実現する。GR-3は、両手操作や移動を必要とする長期的で繊細なタスクにおいても優れた性能を発揮し、堅牢で信頼性の高い動作を示す。これらの能力は、ウェブスケールの視覚-言語データとの共学習、VRデバイスを通じて収集された人間軌跡データからの効率的なファインチューニング、ロボット軌跡データを用いた効果的な模倣学習を含む多面的なトレーニングレシピによって達成されている。さらに、GR-3と統合することで幅広いタスクを達成可能な、優れた柔軟性と信頼性を備えた汎用両手移動ロボットByteMiniを紹介する。広範な実世界実験を通じて、GR-3が最先端のベースライン手法pi_0を多様な困難なタスクにおいて凌駕することを示す。我々は、GR-3が日常生活において人間を支援可能な汎用ロボットの構築に向けた一歩となることを期待する。

English

We report our recent progress towards building generalist robot policies, the development of GR-3. GR-3 is a large-scale vision-language-action (VLA) model. It showcases exceptional capabilities in generalizing to novel objects, environments, and instructions involving abstract concepts. Furthermore, it can be efficiently fine-tuned with minimal human trajectory data, enabling rapid and cost-effective adaptation to new settings. GR-3 also excels in handling long-horizon and dexterous tasks, including those requiring bi-manual manipulation and mobile movement, showcasing robust and reliable performance. These capabilities are achieved through a multi-faceted training recipe that includes co-training with web-scale vision-language data, efficient fine-tuning from human trajectory data collected via VR devices, and effective imitation learning with robot trajectory data. In addition, we introduce ByteMini, a versatile bi-manual mobile robot designed with exceptional flexibility and reliability, capable of accomplishing a wide range of tasks when integrated with GR-3. Through extensive real-world experiments, we show GR-3 surpasses the state-of-the-art baseline method, pi_0, on a wide variety of challenging tasks. We hope GR-3 can serve as a step towards building generalist robots capable of assisting humans in daily life.