RoboCat：ロボティック操作のための自己改善型基盤エージェント

要旨

異なるロボットやタスクから得られた多様なロボット経験を活用し、新しいスキルや身体構造を迅速に習得する能力は、ロボット学習に革命をもたらす可能性を秘めています。視覚と言語における基盤モデルの最近の進展に触発され、我々はロボット操作のための基盤エージェントを提案します。このエージェントは「RoboCat」と名付けられ、視覚目標条件付きの意思決定トランスフォーマーであり、多様な身体構造を持つロボットアームからの動作ラベル付き視覚経験を処理することができます。このデータは、シミュレーションおよび実機のロボットアームから得られた多様なモーター制御スキルを網羅し、観測と動作のセットが異なる環境に及びます。RoboCatを用いることで、新しいタスクやロボットへのゼロショットでの汎化能力、およびターゲットタスクに対してわずか100〜1000例の適応のみで学習する能力を実証します。また、訓練済みモデル自体が後続の訓練イテレーションのためのデータ生成に使用できることを示し、自律的な改善ループの基本的な構成要素を提供します。シミュレーションおよび3種類の実機ロボットを用いた大規模な評価を通じて、エージェントの能力を調査しました。その結果、訓練データを拡大し多様化させることで、RoboCatはタスク間での転移を示すだけでなく、新しいタスクへの適応効率も向上することが明らかになりました。

English

The ability to leverage heterogeneous robotic experience from different robots and tasks to quickly master novel skills and embodiments has the potential to transform robot learning. Inspired by recent advances in foundation models for vision and language, we propose a foundation agent for robotic manipulation. This agent, named RoboCat, is a visual goal-conditioned decision transformer capable of consuming multi-embodiment action-labelled visual experience. This data spans a large repertoire of motor control skills from simulated and real robotic arms with varying sets of observations and actions. With RoboCat, we demonstrate the ability to generalise to new tasks and robots, both zero-shot as well as through adaptation using only 100--1000 examples for the target task. We also show how a trained model itself can be used to generate data for subsequent training iterations, thus providing a basic building block for an autonomous improvement loop. We investigate the agent's capabilities, with large-scale evaluations both in simulation and on three different real robot embodiments. We find that as we grow and diversify its training data, RoboCat not only shows signs of cross-task transfer, but also becomes more efficient at adapting to new tasks.

RoboCat：ロボティック操作のための自己改善型基盤エージェント

RoboCat: A Self-Improving Foundation Agent for Robotic Manipulation

要旨

Support