GR-3 기술 보고서

초록

우리는 범용 로봇 정책 구축을 위한 최근 연구 성과인 GR-3의 개발을 보고한다. GR-3은 대규모 시각-언어-행동(VLA) 모델로, 새로운 객체, 환경 및 추상 개념을 포함한 지시에 대한 탁월한 일반화 능력을 보여준다. 또한, 최소한의 인간 궤적 데이터로 효율적으로 미세 조정이 가능하여 새로운 환경에 빠르고 경제적으로 적응할 수 있다. GR-3은 양손 조작 및 이동이 필요한 장기적이고 정교한 작업에서도 뛰어난 성능을 발휘하며, 견고하고 신뢰할 수 있는 성능을 보여준다. 이러한 능력은 웹 규모의 시각-언어 데이터와의 공동 학습, VR 장치를 통해 수집된 인간 궤적 데이터를 이용한 효율적인 미세 조정, 로봇 궤적 데이터를 활용한 효과적인 모방 학습을 포함한 다각적인 학습 방법을 통해 달성되었다. 또한, GR-3과 통합 시 다양한 작업을 수행할 수 있는 탁월한 유연성과 신뢰성을 갖춘 다목적 양손 이동 로봇인 ByteMini를 소개한다. 광범위한 실험을 통해 GR-3이 다양한 도전적인 작업에서 최신 기준 방법인 pi_0을 능가함을 보여준다. 우리는 GR-3이 일상 생활에서 인간을 지원할 수 있는 범용 로봇 구축을 위한 한 걸음이 되기를 기대한다.

English

We report our recent progress towards building generalist robot policies, the development of GR-3. GR-3 is a large-scale vision-language-action (VLA) model. It showcases exceptional capabilities in generalizing to novel objects, environments, and instructions involving abstract concepts. Furthermore, it can be efficiently fine-tuned with minimal human trajectory data, enabling rapid and cost-effective adaptation to new settings. GR-3 also excels in handling long-horizon and dexterous tasks, including those requiring bi-manual manipulation and mobile movement, showcasing robust and reliable performance. These capabilities are achieved through a multi-faceted training recipe that includes co-training with web-scale vision-language data, efficient fine-tuning from human trajectory data collected via VR devices, and effective imitation learning with robot trajectory data. In addition, we introduce ByteMini, a versatile bi-manual mobile robot designed with exceptional flexibility and reliability, capable of accomplishing a wide range of tasks when integrated with GR-3. Through extensive real-world experiments, we show GR-3 surpasses the state-of-the-art baseline method, pi_0, on a wide variety of challenging tasks. We hope GR-3 can serve as a step towards building generalist robots capable of assisting humans in daily life.

GR-3 기술 보고서

GR-3 Technical Report

초록

Support