通過價值函數預訓練從網絡視頻中實現機器人的離線強化學習

摘要

在網絡數據上進行預訓練已被證明是許多現代機器學習系統廣泛泛化的關鍵因素。如何實現這樣的能力在機器人強化學習（RL）中需要什麼？離線RL方法從機器人經驗數據集中學習，提供了一種將先前數據應用於機器人學習流程的方法。然而，這些方法與視頻數據（如Ego4D）存在“類型不匹配”，這是機器人技術中最大的先前數據集，因為視頻僅提供觀察經驗，而沒有RL方法所需的動作或獎勵標註。在本文中，我們開發了一個系統，完全基於通過時間差學習學習價值函數，以在機器人離線RL中利用大規模人類視頻數據集。我們展示了在視頻數據集上進行價值學習學習到的表示比其他從視頻數據學習的方法更有利於下游機器人離線RL。我們的系統名為V-PTR，結合了在視頻數據上的預訓練優勢和在多樣化機器人數據上進行訓練的機器人離線RL方法，從而產生更好、更穩健和更廣泛泛化的操作任務的價值函數和策略。在一臺真實的WidowX機器人上進行的幾個操作任務中，我們的框架生成的策略明顯優於先前的方法。我們的視頻和更多細節可在https://dibyaghosh.com/vptr/找到。

English

Pre-training on Internet data has proven to be a key ingredient for broad generalization in many modern ML systems. What would it take to enable such capabilities in robotic reinforcement learning (RL)? Offline RL methods, which learn from datasets of robot experience, offer one way to leverage prior data into the robotic learning pipeline. However, these methods have a "type mismatch" with video data (such as Ego4D), the largest prior datasets available for robotics, since video offers observation-only experience without the action or reward annotations needed for RL methods. In this paper, we develop a system for leveraging large-scale human video datasets in robotic offline RL, based entirely on learning value functions via temporal-difference learning. We show that value learning on video datasets learns representations that are more conducive to downstream robotic offline RL than other approaches for learning from video data. Our system, called V-PTR, combines the benefits of pre-training on video data with robotic offline RL approaches that train on diverse robot data, resulting in value functions and policies for manipulation tasks that perform better, act robustly, and generalize broadly. On several manipulation tasks on a real WidowX robot, our framework produces policies that greatly improve over prior methods. Our video and additional details can be found at https://dibyaghosh.com/vptr/

通過價值函數預訓練從網絡視頻中實現機器人的離線強化學習

Robotic Offline RL from Internet Videos via Value-Function Pre-Training

摘要

Support