ロボットがロボットを事前学習する：大規模ロボットデータセットからの操作中心ロボット表現

要旨

視覚表現の事前トレーニングは、ロボットの学習効率を向上させました。大規模なドメイン内ロボティックデータセットの不足から、従来の研究では野生の人間のビデオを使用してロボットの視覚表現を事前トレーニングしてきました。有望な結果にもかかわらず、人間のビデオからの表現は、必然的に分布のシフトを受けやすく、タスク完了に重要なダイナミクス情報が欠如しています。まず、さまざまな事前トレーニングされた表現を、下流のロボティック操作タスク（つまり、操作中心性）との相関に基づいて評価します。興味深いことに、私たちは、「操作中心性」が下流のタスクに適用された際の成功率の強力な指標であることを発見しました。これらの知見に基づき、Manipulation Centric Representation（MCR）を提案します。これは、視覚特徴とアクション、操作認識などのダイナミクス情報を捉える基盤表現学習フレームワークであり、操作中心性を向上させるために設計されています。具体的には、DROIDロボティックデータセットで視覚エンコーダを事前トレーニングし、ロボットの操作認識状態やアクションなどの動きに関連するデータを活用します。視覚観察をロボットの操作認識状態-アクションダイナミクスに整列させる新しい対照的損失を導入し、事前トレーニング中にアクションを予測するための行動クローニング（BC）のようなアクター損失と、時間対照的損失を組み合わせます。20のタスクを持つ4つのシミュレーションドメイン全体での実証結果によると、MCRは最も強力なベースライン手法よりも14.8％性能を向上させています。さらに、MCRは、UR5eアームを使用した3つの実世界タスクにおけるデータ効率の高い学習の性能を76.9％向上させています。プロジェクトのウェブサイト：https://robots-pretrain-robots.github.io/。

English

The pre-training of visual representations has enhanced the efficiency of robot learning. Due to the lack of large-scale in-domain robotic datasets, prior works utilize in-the-wild human videos to pre-train robotic visual representation. Despite their promising results, representations from human videos are inevitably subject to distribution shifts and lack the dynamics information crucial for task completion. We first evaluate various pre-trained representations in terms of their correlation to the downstream robotic manipulation tasks (i.e., manipulation centricity). Interestingly, we find that the "manipulation centricity" is a strong indicator of success rates when applied to downstream tasks. Drawing from these findings, we propose Manipulation Centric Representation (MCR), a foundation representation learning framework capturing both visual features and the dynamics information such as actions and proprioceptions of manipulation tasks to improve manipulation centricity. Specifically, we pre-train a visual encoder on the DROID robotic dataset and leverage motion-relevant data such as robot proprioceptive states and actions. We introduce a novel contrastive loss that aligns visual observations with the robot's proprioceptive state-action dynamics, combined with a behavior cloning (BC)-like actor loss to predict actions during pre-training, along with a time contrastive loss. Empirical results across 4 simulation domains with 20 tasks verify that MCR outperforms the strongest baseline method by 14.8%. Moreover, MCR boosts the performance of data-efficient learning with a UR5e arm on 3 real-world tasks by 76.9%. Project website: https://robots-pretrain-robots.github.io/.

ロボットがロボットを事前学習する：大規模ロボットデータセットからの操作中心ロボット表現

Robots Pre-train Robots: Manipulation-Centric Robotic Representation from Large-Scale Robot Dataset

要旨

Support