확산 정책 정책 최적화

초록

우리는 연속 제어 및 로봇 학습 작업에서 확산 기반 정책(예: 확산 정책)을 세밀하게 조정하는 데 대한 최상의 실천 방법을 포함하는 DPPO(Diffusion Policy Policy Optimization) 알고리즘 프레임워크를 소개합니다. 이는 강화 학습(Reinforcement Learning, RL)의 정책 기울기(Policy Gradient, PG) 방법을 사용합니다. PG 방법은 다른 정책 매개변수화로 RL 정책을 훈련하는 데 널리 사용되지만, 확산 기반 정책에 대해 효율적이지 않을 것으로 추측되었습니다. 놀랍게도, 우리는 DPPO가 일반적인 벤치마크에서 다른 RL 방법 및 다른 정책 매개변수화의 PG 세밀 조정과 비교하여 세밀 조정의 강력한 전반적인 성능과 효율성을 달성한다는 것을 보여줍니다. 실험 조사를 통해, DPPO가 RL 세밀 조정과 확산 매개변수화 간의 독특한 시너지를 활용하여 구조화되고 매니폴드 상의 탐사, 안정적인 훈련 및 강력한 정책 견고성을 이끌어냄을 발견했습니다. 또한 DPPO의 강점을 픽셀 관측을 통한 시뮬레이션된 로봇 작업 및 장기적이고 다단계의 조작 작업에서 로봇 하드웨어에 시뮬레이션으로 훈련된 정책을 제로샷으로 배포함을 통해 다양한 현실적인 환경에서 입증합니다. 코드가 있는 웹사이트: diffusion-ppo.github.io

English

We introduce Diffusion Policy Policy Optimization, DPPO, an algorithmic framework including best practices for fine-tuning diffusion-based policies (e.g. Diffusion Policy) in continuous control and robot learning tasks using the policy gradient (PG) method from reinforcement learning (RL). PG methods are ubiquitous in training RL policies with other policy parameterizations; nevertheless, they had been conjectured to be less efficient for diffusion-based policies. Surprisingly, we show that DPPO achieves the strongest overall performance and efficiency for fine-tuning in common benchmarks compared to other RL methods for diffusion-based policies and also compared to PG fine-tuning of other policy parameterizations. Through experimental investigation, we find that DPPO takes advantage of unique synergies between RL fine-tuning and the diffusion parameterization, leading to structured and on-manifold exploration, stable training, and strong policy robustness. We further demonstrate the strengths of DPPO in a range of realistic settings, including simulated robotic tasks with pixel observations, and via zero-shot deployment of simulation-trained policies on robot hardware in a long-horizon, multi-stage manipulation task. Website with code: diffusion-ppo.github.io