离线强化学习的高效扩散策略

摘要

离线强化学习（RL）旨在从离线数据集中学习最优策略，其中策略的参数化至关重要但经常被忽视。最近，Diffusion-QL通过用扩散模型表示策略，显著提升了离线RL的性能，其成功依赖于一个具有数百步采样的参数化马尔可夫链。然而，Diffusion-QL存在两个关键限制。1）在训练期间通过整个马尔可夫链进行前向和后向计算效率低下。2）它与基于最大似然的RL算法（例如策略梯度方法）不兼容，因为扩散模型的似然难以计算。因此，我们提出了高效扩散策略（EDP）来克服这两个挑战。EDP在训练期间通过从损坏的动作近似构建动作，以避免运行采样链。我们在D4RL基准测试上进行了大量实验。结果显示，EDP可以将扩散策略的训练时间从5天缩短到5小时，适用于gym-locomotion任务。此外，我们展示了EDP与各种离线RL算法（如TD3、CRR和IQL）兼容，并在D4RL上大幅领先于先前方法，取得了新的最先进水平。我们的代码可在https://github.com/sail-sg/edp找到。

English

Offline reinforcement learning (RL) aims to learn optimal policies from offline datasets, where the parameterization of policies is crucial but often overlooked. Recently, Diffsuion-QL significantly boosts the performance of offline RL by representing a policy with a diffusion model, whose success relies on a parametrized Markov Chain with hundreds of steps for sampling. However, Diffusion-QL suffers from two critical limitations. 1) It is computationally inefficient to forward and backward through the whole Markov chain during training. 2) It is incompatible with maximum likelihood-based RL algorithms (e.g., policy gradient methods) as the likelihood of diffusion models is intractable. Therefore, we propose efficient diffusion policy (EDP) to overcome these two challenges. EDP approximately constructs actions from corrupted ones at training to avoid running the sampling chain. We conduct extensive experiments on the D4RL benchmark. The results show that EDP can reduce the diffusion policy training time from 5 days to 5 hours on gym-locomotion tasks. Moreover, we show that EDP is compatible with various offline RL algorithms (TD3, CRR, and IQL) and achieves new state-of-the-art on D4RL by large margins over previous methods. Our code is available at https://github.com/sail-sg/edp.

离线强化学习的高效扩散策略

Efficient Diffusion Policies for Offline Reinforcement Learning

摘要

Support