用于微调行为克隆策略的残差离策略强化学习
Residual Off-Policy RL for Finetuning Behavior Cloning Policies
September 23, 2025
作者: Lars Ankile, Zhenyu Jiang, Rocky Duan, Guanya Shi, Pieter Abbeel, Anusha Nagabandi
cs.AI
摘要
近期,行为克隆(BC)技术的进步使得视觉运动控制策略取得了显著成效。然而,这些方法受限于人类示范的质量、数据收集所需的手动投入,以及随着离线数据增加而带来的边际效益递减。相比之下,强化学习(RL)通过智能体与环境的自主交互进行训练,已在多个领域展现出卓越成就。尽管如此,直接在现实世界的机器人上训练RL策略仍面临样本效率低、安全顾虑,以及从稀疏奖励中学习长时程任务(尤其是高自由度系统)的难题。我们提出了一种结合BC与RL优势的残差学习框架方案。该方法将BC策略作为黑箱基础,通过样本高效的离策略RL学习轻量级的逐步残差修正。我们证明,仅需稀疏的二元奖励信号,该方法就能有效提升高自由度系统在仿真与现实环境中的操作策略。特别值得一提的是,据我们所知,我们首次成功地在具有灵巧手的人形机器人上实现了现实世界的RL训练。我们的成果在多种基于视觉的任务中展现了顶尖性能,为RL在现实世界中的实际应用指明了一条可行路径。项目网站:https://residual-offpolicy-rl.github.io
English
Recent advances in behavior cloning (BC) have enabled impressive visuomotor
control policies. However, these approaches are limited by the quality of human
demonstrations, the manual effort required for data collection, and the
diminishing returns from increasing offline data. In comparison, reinforcement
learning (RL) trains an agent through autonomous interaction with the
environment and has shown remarkable success in various domains. Still,
training RL policies directly on real-world robots remains challenging due to
sample inefficiency, safety concerns, and the difficulty of learning from
sparse rewards for long-horizon tasks, especially for high-degree-of-freedom
(DoF) systems. We present a recipe that combines the benefits of BC and RL
through a residual learning framework. Our approach leverages BC policies as
black-box bases and learns lightweight per-step residual corrections via
sample-efficient off-policy RL. We demonstrate that our method requires only
sparse binary reward signals and can effectively improve manipulation policies
on high-degree-of-freedom (DoF) systems in both simulation and the real world.
In particular, we demonstrate, to the best of our knowledge, the first
successful real-world RL training on a humanoid robot with dexterous hands. Our
results demonstrate state-of-the-art performance in various vision-based tasks,
pointing towards a practical pathway for deploying RL in the real world.
Project website: https://residual-offpolicy-rl.github.io