高效的離線強化學習擴散策略

摘要

離線強化學習（RL）旨在從離線數據集中學習最優策略，其中策略的參數化至關重要，但常常被忽視。最近，Diffusion-QL通過使用擴散模型來表示策略，顯著提高了離線RL的性能，其成功取決於具有數百步採樣的參數化馬爾可夫鏈。然而，Diffusion-QL存在兩個關鍵限制。1）在訓練期間通過整個馬爾可夫鏈進行前向和後向計算效率低下。2）它與基於最大似然的RL算法（例如，策略梯度方法）不相容，因為擴散模型的似然是棘手的。因此，我們提出了高效擴散策略（EDP）來克服這兩個挑戰。EDP在訓練期間通過從受損動作中近似構建動作，以避免運行採樣鏈。我們在D4RL基準測試上進行了大量實驗。結果顯示，EDP可以將擴散策略的訓練時間從5天縮短到5小時，適用於gym-locomotion任務。此外，我們展示了EDP與各種離線RL算法（如TD3、CRR和IQL）兼容，並在D4RL上以大幅度超越先前方法的新最先進水平。我們的代碼可在https://github.com/sail-sg/edp找到。

English

Offline reinforcement learning (RL) aims to learn optimal policies from offline datasets, where the parameterization of policies is crucial but often overlooked. Recently, Diffsuion-QL significantly boosts the performance of offline RL by representing a policy with a diffusion model, whose success relies on a parametrized Markov Chain with hundreds of steps for sampling. However, Diffusion-QL suffers from two critical limitations. 1) It is computationally inefficient to forward and backward through the whole Markov chain during training. 2) It is incompatible with maximum likelihood-based RL algorithms (e.g., policy gradient methods) as the likelihood of diffusion models is intractable. Therefore, we propose efficient diffusion policy (EDP) to overcome these two challenges. EDP approximately constructs actions from corrupted ones at training to avoid running the sampling chain. We conduct extensive experiments on the D4RL benchmark. The results show that EDP can reduce the diffusion policy training time from 5 days to 5 hours on gym-locomotion tasks. Moreover, we show that EDP is compatible with various offline RL algorithms (TD3, CRR, and IQL) and achieves new state-of-the-art on D4RL by large margins over previous methods. Our code is available at https://github.com/sail-sg/edp.

高效的離線強化學習擴散策略

Efficient Diffusion Policies for Offline Reinforcement Learning

摘要

Support