行動クローニングポリシーの微調整のための残差オフポリシー強化学習

要旨

近年、行動模倣（Behavior Cloning, BC）の進展により、視覚運動制御ポリシーの性能が著しく向上している。しかし、これらの手法は人間によるデモンストレーションの質、データ収集に必要な手作業、およびオフラインデータの増加に伴う収穫逓減に制約されている。一方、強化学習（Reinforcement Learning, RL）は、環境との自律的な相互作用を通じてエージェントを訓練し、さまざまな領域で顕著な成功を収めている。しかし、現実世界のロボットに対して直接RLポリシーを訓練することは、サンプル効率の低さ、安全性の問題、および長期的なタスクにおける疎な報酬からの学習の難しさ、特に高自由度（Degree-of-Freedom, DoF）システムにおいて依然として課題が多い。本研究では、残差学習フレームワークを通じてBCとRLの利点を組み合わせた手法を提案する。我々のアプローチは、BCポリシーをブラックボックス基盤として活用し、サンプル効率の高いオフポリシーRLを通じて軽量なステップごとの残差補正を学習する。本手法は、疎な二値報酬信号のみを必要とし、シミュレーションおよび現実世界において高自由度システムの操作ポリシーを効果的に改善できることを実証する。特に、我々の知る限り、初めて現実世界のヒューマノイドロボットにおける器用な手を用いたRL訓練の成功を示す。我々の結果は、視覚ベースのタスクにおいて最先端の性能を達成し、現実世界でのRLの実用的な展開に向けた道筋を示している。プロジェクトウェブサイト: https://residual-offpolicy-rl.github.io

English

Recent advances in behavior cloning (BC) have enabled impressive visuomotor control policies. However, these approaches are limited by the quality of human demonstrations, the manual effort required for data collection, and the diminishing returns from increasing offline data. In comparison, reinforcement learning (RL) trains an agent through autonomous interaction with the environment and has shown remarkable success in various domains. Still, training RL policies directly on real-world robots remains challenging due to sample inefficiency, safety concerns, and the difficulty of learning from sparse rewards for long-horizon tasks, especially for high-degree-of-freedom (DoF) systems. We present a recipe that combines the benefits of BC and RL through a residual learning framework. Our approach leverages BC policies as black-box bases and learns lightweight per-step residual corrections via sample-efficient off-policy RL. We demonstrate that our method requires only sparse binary reward signals and can effectively improve manipulation policies on high-degree-of-freedom (DoF) systems in both simulation and the real world. In particular, we demonstrate, to the best of our knowledge, the first successful real-world RL training on a humanoid robot with dexterous hands. Our results demonstrate state-of-the-art performance in various vision-based tasks, pointing towards a practical pathway for deploying RL in the real world. Project website: https://residual-offpolicy-rl.github.io

行動クローニングポリシーの微調整のための残差オフポリシー強化学習

Residual Off-Policy RL for Finetuning Behavior Cloning Policies

要旨

Support