センサモータープリトレーニングを用いたロボット学習

要旨

本論文では、ロボティクスにおける自己教師ありの感覚運動事前学習手法を提案する。我々のモデル「RPT」は、感覚運動トークンのシーケンス上で動作するTransformerである。カメラ画像、ロボットのプロプリオセプティブ状態、および過去の行動のシーケンスが与えられた場合、これらを交互に並べたシーケンスをトークンにエンコードし、ランダムなサブセットをマスクした上で、マスクされた内容を予測するようにモデルを訓練する。ロボットが欠落した内容を予測できる場合、物理世界の良好なモデルを獲得し、行動を可能にすると仮定する。RPTは潜在的な視覚表現上で動作するように設計されており、予測を扱いやすくし、10倍大きなモデルへのスケーリングと、実機上での10Hzの推論を可能にする。本手法を評価するため、モーションプランニングとモデルベースの把持アルゴリズムを組み合わせて、9ヶ月間にわたって20,000の実世界の軌跡データセットを収集した。このデータを用いた事前学習は、スクラッチからの学習を一貫して上回り、ブロック積み上げタスクにおいて2倍の改善をもたらし、良好なスケーリング特性を示すことがわかった。

English

We present a self-supervised sensorimotor pre-training approach for robotics. Our model, called RPT, is a Transformer that operates on sequences of sensorimotor tokens. Given a sequence of camera images, proprioceptive robot states, and past actions, we encode the interleaved sequence into tokens, mask out a random subset, and train a model to predict the masked-out content. We hypothesize that if the robot can predict the missing content it has acquired a good model of the physical world that can enable it to act. RPT is designed to operate on latent visual representations which makes prediction tractable, enables scaling to 10x larger models, and 10 Hz inference on a real robot. To evaluate our approach, we collect a dataset of 20,000 real-world trajectories over 9 months using a combination of motion planning and model-based grasping algorithms. We find that pre-training on this data consistently outperforms training from scratch, leads to 2x improvements in the block stacking task, and has favorable scaling properties.

センサモータープリトレーニングを用いたロボット学習

Robot Learning with Sensorimotor Pre-training

要旨

Support