用于样本高效连续控制的基于模型的去偏表征
Debiased Model-based Representations for Sample-efficient Continuous Control
May 12, 2026
作者: Jiafei Lyu, Zichuan Lin, Scott Fujimoto, Kai Yang, Yangkun Chen, Saiyong Yang, Zongqing Lu, Deheng Ye
cs.AI
摘要
基于模型的表示最近作为一种有前景的框架脱颖而出,它将潜在动力学信息嵌入到表示中,用于离策略演员-评论家学习。该方法隐式结合了无模型和基于模型方法的优势,同时避免了基于模型方法的训练成本。然而,现有的基于模型表示方法可能无法捕捉到足够的相关变量信息,并且可能过拟合回放缓冲区中的早期经验。这会导致表示和演员-评论家学习产生偏差,从而降低性能。为解决此问题,我们提出了去偏的基于模型的Q学习表示算法,标记为DR.Q算法。DR.Q显式最大化当前状态-动作对表示与下一状态之间的互信息,同时最小化它们的偏差,并通过衰减优先经验回放采样转移。我们在多个连续控制基准上使用单一超参数集评估DR.Q,结果表明DR.Q能够匹配或超越近期强基线,有时以较大幅度超越它们。我们的代码可在https://github.com/dmksjfl/DR.Q获取。
English
Model-based representations recently stand out as a promising framework that embeds latent dynamics information into the representations for downstream off-policy actor-critic learning. It implicitly combines the advantages of both model-free and model-based approaches while avoiding the training costs associated with model-based methods. Nevertheless, existing model-based representation methods can fail to capture sufficient information about relevant variables and can overfit to early experiences in the replay buffer. These incur biases in representation and actor-critic learning, leading to inferior performance. To address this, we propose Debiased model-based Representations for Q-learning, tagged DR.Q algorithm. DR.Q explicitly maximizes the mutual information between the representations of the current state-action pair and the next state besides minimizing their deviations, and samples transitions with faded prioritized experience replay. We evaluate DR.Q on numerous continuous control benchmarks with a single set of hyperparameters, and the results demonstrate that DR.Q can match or surpass recent strong baselines, sometimes outperforming them by a large margin. Our code is available at https://github.com/dmksjfl/DR.Q.