用于样本高效连续控制的基于模型的去偏表征

摘要

基于模型的表示最近作为一种有前景的框架脱颖而出，它将潜在动力学信息嵌入到表示中，用于离策略演员-评论家学习。该方法隐式结合了无模型和基于模型方法的优势，同时避免了基于模型方法的训练成本。然而，现有的基于模型表示方法可能无法捕捉到足够的相关变量信息，并且可能过拟合回放缓冲区中的早期经验。这会导致表示和演员-评论家学习产生偏差，从而降低性能。为解决此问题，我们提出了去偏的基于模型的Q学习表示算法，标记为DR.Q算法。DR.Q显式最大化当前状态-动作对表示与下一状态之间的互信息，同时最小化它们的偏差，并通过衰减优先经验回放采样转移。我们在多个连续控制基准上使用单一超参数集评估DR.Q，结果表明DR.Q能够匹配或超越近期强基线，有时以较大幅度超越它们。我们的代码可在https://github.com/dmksjfl/DR.Q获取。

English

Model-based representations recently stand out as a promising framework that embeds latent dynamics information into the representations for downstream off-policy actor-critic learning. It implicitly combines the advantages of both model-free and model-based approaches while avoiding the training costs associated with model-based methods. Nevertheless, existing model-based representation methods can fail to capture sufficient information about relevant variables and can overfit to early experiences in the replay buffer. These incur biases in representation and actor-critic learning, leading to inferior performance. To address this, we propose Debiased model-based Representations for Q-learning, tagged DR.Q algorithm. DR.Q explicitly maximizes the mutual information between the representations of the current state-action pair and the next state besides minimizing their deviations, and samples transitions with faded prioritized experience replay. We evaluate DR.Q on numerous continuous control benchmarks with a single set of hyperparameters, and the results demonstrate that DR.Q can match or surpass recent strong baselines, sometimes outperforming them by a large margin. Our code is available at https://github.com/dmksjfl/DR.Q.

用于样本高效连续控制的基于模型的去偏表征

Debiased Model-based Representations for Sample-efficient Continuous Control

摘要

Support