扩散策略优化

摘要

我们介绍了扩散策略优化（DPPO），这是一个算法框架，包括了在连续控制和机器人学习任务中对基于扩散的策略（如扩散策略）进行微调的最佳实践，使用来自强化学习的策略梯度（PG）方法。PG方法在训练其他策略参数化的RL策略中是无处不在的；然而，据推测，对于基于扩散的策略来说，它们可能效率较低。令人惊讶的是，我们展示了DPPO相对于其他RL方法在基于扩散的策略的微调中实现了最强的整体性能和效率，同时与其他策略参数化的PG微调相比也是如此。通过实验研究，我们发现DPPO利用了RL微调和扩散参数化之间独特的协同作用，实现了结构化和在流形上的探索，稳定的训练以及强大的策略鲁棒性。我们进一步展示了DPPO在一系列现实环境中的优势，包括使用像素观测进行模拟机器人任务，并通过在长时间跨度、多阶段操作任务中对机器人硬件进行零次部署的模拟训练策略。网站代码：diffusion-ppo.github.io

English

We introduce Diffusion Policy Policy Optimization, DPPO, an algorithmic framework including best practices for fine-tuning diffusion-based policies (e.g. Diffusion Policy) in continuous control and robot learning tasks using the policy gradient (PG) method from reinforcement learning (RL). PG methods are ubiquitous in training RL policies with other policy parameterizations; nevertheless, they had been conjectured to be less efficient for diffusion-based policies. Surprisingly, we show that DPPO achieves the strongest overall performance and efficiency for fine-tuning in common benchmarks compared to other RL methods for diffusion-based policies and also compared to PG fine-tuning of other policy parameterizations. Through experimental investigation, we find that DPPO takes advantage of unique synergies between RL fine-tuning and the diffusion parameterization, leading to structured and on-manifold exploration, stable training, and strong policy robustness. We further demonstrate the strengths of DPPO in a range of realistic settings, including simulated robotic tasks with pixel observations, and via zero-shot deployment of simulation-trained policies on robot hardware in a long-horizon, multi-stage manipulation task. Website with code: diffusion-ppo.github.io