传感器运动预训练的机器人学习

摘要

我们提出了一种自监督的感知运动预训练方法，用于机器人技术。我们的模型名为RPT，是一个Transformer，用于处理感知运动令牌序列。给定一系列摄像头图像、本体感知机器人状态和过去的动作，我们将交错的序列编码为令牌，对随机子集进行蒙版处理，并训练模型来预测被蒙版处理的内容。我们假设，如果机器人能够预测缺失的内容，那么它已经获得了一个能够使其行动的物理世界的良好模型。RPT旨在处理潜在的视觉表示，这使得预测变得可行，实现了10倍更大模型的扩展，并在真实机器人上实现了每秒10次的推理。为了评估我们的方法，我们使用运动规划和基于模型的抓取算法结合，收集了9个月内的20,000条真实世界轨迹数据集。我们发现，在这些数据上进行预训练始终优于从头开始训练，导致在堆积积木任务中提高了2倍，并具有良好的扩展性能。

English

We present a self-supervised sensorimotor pre-training approach for robotics. Our model, called RPT, is a Transformer that operates on sequences of sensorimotor tokens. Given a sequence of camera images, proprioceptive robot states, and past actions, we encode the interleaved sequence into tokens, mask out a random subset, and train a model to predict the masked-out content. We hypothesize that if the robot can predict the missing content it has acquired a good model of the physical world that can enable it to act. RPT is designed to operate on latent visual representations which makes prediction tractable, enables scaling to 10x larger models, and 10 Hz inference on a real robot. To evaluate our approach, we collect a dataset of 20,000 real-world trajectories over 9 months using a combination of motion planning and model-based grasping algorithms. We find that pre-training on this data consistently outperforms training from scratch, leads to 2x improvements in the block stacking task, and has favorable scaling properties.

传感器运动预训练的机器人学习

Robot Learning with Sensorimotor Pre-training

摘要

Support