센서모터 사전 학습을 통한 로봇 학습

초록

로봇 공학을 위한 자기 지도 방식의 감각운동 사전 학습 접근법을 제안한다. 우리의 모델인 RPT는 감각운동 토큰 시퀀스에서 작동하는 트랜스포머(Transformer)이다. 카메라 이미지, 로봇의 자세 정보, 그리고 과거 동작의 시퀀스가 주어지면, 우리는 이 인터리브된 시퀀스를 토큰으로 인코딩하고 무작위로 선택된 부분을 마스킹한 후, 모델이 마스킹된 내용을 예측하도록 학습시킨다. 로봇이 누락된 내용을 예측할 수 있다면, 이는 물리적 세계에 대한 좋은 모델을 획득했음을 의미하며, 이를 통해 행동할 수 있게 된다고 가정한다. RPT는 잠재적 시각 표현에서 작동하도록 설계되어 예측을 용이하게 하고, 모델 크기를 10배까지 확장할 수 있으며, 실제 로봇에서 10Hz의 추론 속도를 가능하게 한다. 이 접근법을 평가하기 위해, 우리는 모션 플래닝과 모델 기반 그랩핑 알고리즘을 조합하여 9개월 동안 20,000개의 실제 궤적 데이터셋을 수집했다. 이 데이터에 대한 사전 학습은 처음부터 학습하는 것보다 일관되게 우수한 성능을 보였으며, 블록 쌓기 작업에서 2배의 성능 향상을 이끌었고, 확장성 측면에서도 유리한 특성을 보였다.

English

We present a self-supervised sensorimotor pre-training approach for robotics. Our model, called RPT, is a Transformer that operates on sequences of sensorimotor tokens. Given a sequence of camera images, proprioceptive robot states, and past actions, we encode the interleaved sequence into tokens, mask out a random subset, and train a model to predict the masked-out content. We hypothesize that if the robot can predict the missing content it has acquired a good model of the physical world that can enable it to act. RPT is designed to operate on latent visual representations which makes prediction tractable, enables scaling to 10x larger models, and 10 Hz inference on a real robot. To evaluate our approach, we collect a dataset of 20,000 real-world trajectories over 9 months using a combination of motion planning and model-based grasping algorithms. We find that pre-training on this data consistently outperforms training from scratch, leads to 2x improvements in the block stacking task, and has favorable scaling properties.

센서모터 사전 학습을 통한 로봇 학습

Robot Learning with Sensorimotor Pre-training

초록

Support