インターネット動画からのロボティックオフライン強化学習：価値関数事前学習を介して

要旨

インターネットデータを用いた事前学習は、多くの現代の機械学習システムにおいて広範な汎化能力を実現するための重要な要素であることが証明されています。では、ロボットの強化学習（RL）において、このような能力を実現するためには何が必要でしょうか？ロボットの経験データセットから学習するオフラインRL手法は、事前データをロボット学習パイプラインに活用する一つの方法を提供します。しかし、これらの手法は、ロボティクスで利用可能な最大の事前データセットであるビデオデータ（例えばEgo4D）との間に「タイプミスマッチ」が生じます。なぜなら、ビデオデータは観測のみの経験を提供し、RL手法に必要な行動や報酬のアノテーションが含まれていないからです。本論文では、大規模な人間のビデオデータセットをロボットのオフラインRLに活用するシステムを開発します。このシステムは、時間的差分学習を通じて価値関数を学習することに完全に基づいています。ビデオデータセット上での価値学習が、ビデオデータから学習する他のアプローチよりも、下流のロボットオフラインRLに適した表現を学習することを示します。私たちのシステム「V-PTR」は、ビデオデータを用いた事前学習の利点と、多様なロボットデータを用いて訓練するロボットオフラインRLアプローチの利点を組み合わせることで、より優れた性能を持ち、堅牢に動作し、広く汎化する操作タスクの価値関数とポリシーを実現します。実機WidowXロボットを用いたいくつかの操作タスクにおいて、私たちのフレームワークは従来の手法を大幅に上回るポリシーを生成します。ビデオと追加の詳細は、https://dibyaghosh.com/vptr/ でご覧いただけます。

English

Pre-training on Internet data has proven to be a key ingredient for broad generalization in many modern ML systems. What would it take to enable such capabilities in robotic reinforcement learning (RL)? Offline RL methods, which learn from datasets of robot experience, offer one way to leverage prior data into the robotic learning pipeline. However, these methods have a "type mismatch" with video data (such as Ego4D), the largest prior datasets available for robotics, since video offers observation-only experience without the action or reward annotations needed for RL methods. In this paper, we develop a system for leveraging large-scale human video datasets in robotic offline RL, based entirely on learning value functions via temporal-difference learning. We show that value learning on video datasets learns representations that are more conducive to downstream robotic offline RL than other approaches for learning from video data. Our system, called V-PTR, combines the benefits of pre-training on video data with robotic offline RL approaches that train on diverse robot data, resulting in value functions and policies for manipulation tasks that perform better, act robustly, and generalize broadly. On several manipulation tasks on a real WidowX robot, our framework produces policies that greatly improve over prior methods. Our video and additional details can be found at https://dibyaghosh.com/vptr/

インターネット動画からのロボティックオフライン強化学習：価値関数事前学習を介して

Robotic Offline RL from Internet Videos via Value-Function Pre-Training

要旨

Support