用於微調行為克隆策略的殘差離策略強化學習

摘要

近期，行為克隆（BC）技術的進步已實現了令人矚目的視覺運動控制策略。然而，這些方法受限於人類示範的質量、數據收集所需的手動努力，以及增加離線數據所帶來的收益遞減。相比之下，強化學習（RL）通過與環境的自主互動來訓練代理，並在多個領域展現了顯著的成功。然而，直接在現實世界的機器人上訓練RL策略仍面臨挑戰，原因包括樣本效率低下、安全考量，以及從稀疏獎勵中學習長時程任務的困難，尤其是對於高自由度（DoF）系統。我們提出了一種結合BC與RL優勢的配方，通過殘差學習框架實現。我們的方法利用BC策略作為黑箱基礎，並通過樣本效率高的離策略RL學習輕量級的每步殘差校正。我們證明，該方法僅需稀疏的二進制獎勵信號，並能有效提升高自由度（DoF）系統在模擬與現實世界中的操作策略。特別地，我們展示了，據我們所知，首次在具有靈巧手的人形機器人上成功進行現實世界的RL訓練。我們的結果在多種基於視覺的任務中展示了頂尖性能，為在現實世界中部署RL指明了一條實用路徑。項目網站：https://residual-offpolicy-rl.github.io

English

Recent advances in behavior cloning (BC) have enabled impressive visuomotor control policies. However, these approaches are limited by the quality of human demonstrations, the manual effort required for data collection, and the diminishing returns from increasing offline data. In comparison, reinforcement learning (RL) trains an agent through autonomous interaction with the environment and has shown remarkable success in various domains. Still, training RL policies directly on real-world robots remains challenging due to sample inefficiency, safety concerns, and the difficulty of learning from sparse rewards for long-horizon tasks, especially for high-degree-of-freedom (DoF) systems. We present a recipe that combines the benefits of BC and RL through a residual learning framework. Our approach leverages BC policies as black-box bases and learns lightweight per-step residual corrections via sample-efficient off-policy RL. We demonstrate that our method requires only sparse binary reward signals and can effectively improve manipulation policies on high-degree-of-freedom (DoF) systems in both simulation and the real world. In particular, we demonstrate, to the best of our knowledge, the first successful real-world RL training on a humanoid robot with dexterous hands. Our results demonstrate state-of-the-art performance in various vision-based tasks, pointing towards a practical pathway for deploying RL in the real world. Project website: https://residual-offpolicy-rl.github.io

用於微調行為克隆策略的殘差離策略強化學習

Residual Off-Policy RL for Finetuning Behavior Cloning Policies

摘要

Support