擴散政策優化

摘要

我們介紹了擴散策略策略優化（DPPO），這是一個算法框架，包括了微調基於擴散的策略（例如擴散策略）在連續控制和機器人學習任務中的最佳實踐，使用了來自強化學習（RL）的策略梯度（PG）方法。PG方法在訓練RL策略時普遍存在，使用其他策略參數化；然而，據推測對於基於擴散的策略來說，它們可能效率較低。令人驚訝的是，我們展示了DPPO在常見基準測試中相對於其他RL方法以及對其他策略參數化的PG微調，實現了最強的整體性能和效率。通過實驗研究，我們發現DPPO利用RL微調和擴散參數化之間獨特的協同作用，實現了結構化和在流形上的探索，穩定的訓練以及強大的策略韌性。我們進一步展示了DPPO在各種現實情境中的優勢，包括使用像素觀察進行模擬機器人任務，以及通過將在模擬中訓練的策略零樣本部署到機器人硬件上，在長時間跨度、多階段操作任務中。網站代碼：diffusion-ppo.github.io

English

We introduce Diffusion Policy Policy Optimization, DPPO, an algorithmic framework including best practices for fine-tuning diffusion-based policies (e.g. Diffusion Policy) in continuous control and robot learning tasks using the policy gradient (PG) method from reinforcement learning (RL). PG methods are ubiquitous in training RL policies with other policy parameterizations; nevertheless, they had been conjectured to be less efficient for diffusion-based policies. Surprisingly, we show that DPPO achieves the strongest overall performance and efficiency for fine-tuning in common benchmarks compared to other RL methods for diffusion-based policies and also compared to PG fine-tuning of other policy parameterizations. Through experimental investigation, we find that DPPO takes advantage of unique synergies between RL fine-tuning and the diffusion parameterization, leading to structured and on-manifold exploration, stable training, and strong policy robustness. We further demonstrate the strengths of DPPO in a range of realistic settings, including simulated robotic tasks with pixel observations, and via zero-shot deployment of simulation-trained policies on robot hardware in a long-horizon, multi-stage manipulation task. Website with code: diffusion-ppo.github.io