オフライン強化学習のための効率的な拡散ポリシー

要旨

オフライン強化学習（RL）は、オフラインデータセットから最適なポリシーを学習することを目的としており、ポリシーのパラメータ化が重要であるものの、しばしば見過ごされています。最近、Diffusion-QLは、ポリシーを拡散モデルで表現することで、オフラインRLの性能を大幅に向上させました。その成功は、数百ステップのパラメータ化されたマルコフ連鎖を用いたサンプリングに依存しています。しかし、Diffusion-QLには2つの重大な制限があります。1) 訓練中にマルコフ連鎖全体を順方向および逆方向に通過するのは計算効率が悪い。2) 拡散モデルの尤度が計算不可能であるため、最尤法に基づくRLアルゴリズム（例：ポリシー勾配法）と互換性がない。そこで、我々はこれらの2つの課題を克服するために、効率的な拡散ポリシー（EDP）を提案します。EDPは、訓練中にサンプリング連鎖を実行せずに、破損したアクションから近似してアクションを構築します。D4RLベンチマークで広範な実験を行いました。その結果、EDPはgym-locomotionタスクにおいて、拡散ポリシーの訓練時間を5日間から5時間に短縮できることが示されました。さらに、EDPは様々なオフラインRLアルゴリズム（TD3、CRR、IQL）と互換性があり、D4RLにおいて従来の手法を大幅に上回る新たな最先端の性能を達成することが示されました。我々のコードはhttps://github.com/sail-sg/edpで公開されています。

English

Offline reinforcement learning (RL) aims to learn optimal policies from offline datasets, where the parameterization of policies is crucial but often overlooked. Recently, Diffsuion-QL significantly boosts the performance of offline RL by representing a policy with a diffusion model, whose success relies on a parametrized Markov Chain with hundreds of steps for sampling. However, Diffusion-QL suffers from two critical limitations. 1) It is computationally inefficient to forward and backward through the whole Markov chain during training. 2) It is incompatible with maximum likelihood-based RL algorithms (e.g., policy gradient methods) as the likelihood of diffusion models is intractable. Therefore, we propose efficient diffusion policy (EDP) to overcome these two challenges. EDP approximately constructs actions from corrupted ones at training to avoid running the sampling chain. We conduct extensive experiments on the D4RL benchmark. The results show that EDP can reduce the diffusion policy training time from 5 days to 5 hours on gym-locomotion tasks. Moreover, we show that EDP is compatible with various offline RL algorithms (TD3, CRR, and IQL) and achieves new state-of-the-art on D4RL by large margins over previous methods. Our code is available at https://github.com/sail-sg/edp.

オフライン強化学習のための効率的な拡散ポリシー

Efficient Diffusion Policies for Offline Reinforcement Learning

要旨

Support