サンプル効率的な連続制御のための脱バイアスモデルベース表現

要旨

モデルベース表現は、下流のオフポリシーアクタークリティック学習のための表現に潜在的な動的情報を埋め込む有望なフレームワークとして最近注目されている。これは、モデルフリー手法とモデルベース手法の両方の利点を暗黙的に組み合わせると同時に、モデルベース手法に伴う学習コストを回避する。しかしながら、既存のモデルベース表現手法は、関連変数に関する十分な情報を捉えきれず、リプレイバッファ内の初期経験に過適合する可能性がある。これにより、表現学習およびアクタークリティック学習にバイアスが生じ、性能の低下を招く。この問題に対処するため、我々はDR.Qアルゴリズム（Debiased model-based Representations for Q-learning）を提案する。DR.Qは、現在の状態行動ペアと次の状態の表現間の乖離を最小化することに加えて、それらの相互情報量を明示的に最大化し、減衰優先度付き経験再生を用いて遷移をサンプリングする。我々はDR.Qを、単一のハイパーパラメータセットを用いて多数の連続制御ベンチマークで評価し、その結果、DR.Qが最近の強力なベースラインと同等またはそれを上回り、場合によっては大幅に上回る性能を達成することを示す。我々のコードはhttps://github.com/dmksjfl/DR.Qで入手可能である。

English

Model-based representations recently stand out as a promising framework that embeds latent dynamics information into the representations for downstream off-policy actor-critic learning. It implicitly combines the advantages of both model-free and model-based approaches while avoiding the training costs associated with model-based methods. Nevertheless, existing model-based representation methods can fail to capture sufficient information about relevant variables and can overfit to early experiences in the replay buffer. These incur biases in representation and actor-critic learning, leading to inferior performance. To address this, we propose Debiased model-based Representations for Q-learning, tagged DR.Q algorithm. DR.Q explicitly maximizes the mutual information between the representations of the current state-action pair and the next state besides minimizing their deviations, and samples transitions with faded prioritized experience replay. We evaluate DR.Q on numerous continuous control benchmarks with a single set of hyperparameters, and the results demonstrate that DR.Q can match or surpass recent strong baselines, sometimes outperforming them by a large margin. Our code is available at https://github.com/dmksjfl/DR.Q.

サンプル効率的な連続制御のための脱バイアスモデルベース表現

Debiased Model-based Representations for Sample-efficient Continuous Control

要旨

Support