用于微调行为克隆策略的残差离策略强化学习

摘要

近期，行为克隆（BC）技术的进步使得视觉运动控制策略取得了显著成效。然而，这些方法受限于人类示范的质量、数据收集所需的手动投入，以及随着离线数据增加而带来的边际效益递减。相比之下，强化学习（RL）通过智能体与环境的自主交互进行训练，已在多个领域展现出卓越成就。尽管如此，直接在现实世界的机器人上训练RL策略仍面临样本效率低、安全顾虑，以及从稀疏奖励中学习长时程任务（尤其是高自由度系统）的难题。我们提出了一种结合BC与RL优势的残差学习框架方案。该方法将BC策略作为黑箱基础，通过样本高效的离策略RL学习轻量级的逐步残差修正。我们证明，仅需稀疏的二元奖励信号，该方法就能有效提升高自由度系统在仿真与现实环境中的操作策略。特别值得一提的是，据我们所知，我们首次成功地在具有灵巧手的人形机器人上实现了现实世界的RL训练。我们的成果在多种基于视觉的任务中展现了顶尖性能，为RL在现实世界中的实际应用指明了一条可行路径。项目网站：https://residual-offpolicy-rl.github.io

English

Recent advances in behavior cloning (BC) have enabled impressive visuomotor control policies. However, these approaches are limited by the quality of human demonstrations, the manual effort required for data collection, and the diminishing returns from increasing offline data. In comparison, reinforcement learning (RL) trains an agent through autonomous interaction with the environment and has shown remarkable success in various domains. Still, training RL policies directly on real-world robots remains challenging due to sample inefficiency, safety concerns, and the difficulty of learning from sparse rewards for long-horizon tasks, especially for high-degree-of-freedom (DoF) systems. We present a recipe that combines the benefits of BC and RL through a residual learning framework. Our approach leverages BC policies as black-box bases and learns lightweight per-step residual corrections via sample-efficient off-policy RL. We demonstrate that our method requires only sparse binary reward signals and can effectively improve manipulation policies on high-degree-of-freedom (DoF) systems in both simulation and the real world. In particular, we demonstrate, to the best of our knowledge, the first successful real-world RL training on a humanoid robot with dexterous hands. Our results demonstrate state-of-the-art performance in various vision-based tasks, pointing towards a practical pathway for deploying RL in the real world. Project website: https://residual-offpolicy-rl.github.io

用于微调行为克隆策略的残差离策略强化学习

Residual Off-Policy RL for Finetuning Behavior Cloning Policies

摘要

Support