基於模型的去偏表徵用於樣本高效的連續控制

摘要

基於模型的表示方法近期成為一種有前景的框架，能將潛在動態信息嵌入表徵中，以應用於離線策略的行動者-評論者學習。該方法隱含地結合了無模型與基於模型方法的優勢，同時避免了基於模型方法所衍生的訓練成本。然而，現有的基於模型表示方法可能無法捕捉足夠的相關變量信息，且容易過度擬合回放緩衝區中的早期經驗。這些因素會導致表徵及行動者-評論者學習產生偏差，進而影響性能表現。為解決此問題，我們提出去偏的基於模型表示Q學習演算法，即DR.Q演算法。DR.Q不僅最小化當前狀態-動作對表徵與下一狀態間的偏差，更明確最大化兩者間的互信息，並透過衰減優先經驗回放機制進行轉移採樣。我們以單一超參數集在多項連續控制基準測試中評估DR.Q，結果顯示該方法能比擬甚至超越近期強基線算法，部分情況下更大幅勝出。我們的程式碼已公開於 https://github.com/dmksjfl/DR.Q。

English

Model-based representations recently stand out as a promising framework that embeds latent dynamics information into the representations for downstream off-policy actor-critic learning. It implicitly combines the advantages of both model-free and model-based approaches while avoiding the training costs associated with model-based methods. Nevertheless, existing model-based representation methods can fail to capture sufficient information about relevant variables and can overfit to early experiences in the replay buffer. These incur biases in representation and actor-critic learning, leading to inferior performance. To address this, we propose Debiased model-based Representations for Q-learning, tagged DR.Q algorithm. DR.Q explicitly maximizes the mutual information between the representations of the current state-action pair and the next state besides minimizing their deviations, and samples transitions with faded prioritized experience replay. We evaluate DR.Q on numerous continuous control benchmarks with a single set of hyperparameters, and the results demonstrate that DR.Q can match or surpass recent strong baselines, sometimes outperforming them by a large margin. Our code is available at https://github.com/dmksjfl/DR.Q.

基於模型的去偏表徵用於樣本高效的連續控制

Debiased Model-based Representations for Sample-efficient Continuous Control

摘要

Support