如有可能，优先使用在线网络：迈向快速稳定的强化学习

摘要

目标网络的使用是深度强化学习（RL）中估计价值函数的一种流行方法。虽然有效，但目标网络仍是一种折衷方案，它通过牺牲目标更新速度来保持稳定性，从而延缓了学习进程。相反，直接使用在线网络作为自举目标在直觉上颇具吸引力，尽管众所周知这会导致学习过程不稳定。在本研究中，我们旨在通过引入一种新颖的更新规则来兼顾两者优势，该规则利用目标网络与在线网络之间的最小值估计来计算目标，由此诞生了我们的方法——MINTO。通过这一简单却有效的改进，我们展示了MINTO能够通过缓解使用在线网络进行自举时可能存在的过高估计偏差，实现更快且稳定的价值函数学习。值得注意的是，MINTO能够以极小的成本无缝集成到多种基于价值和演员-评论家算法中。我们在涵盖在线与离线RL、离散与连续动作空间的多样化基准上对MINTO进行了广泛评估。在所有基准测试中，MINTO均一致提升了性能，充分证明了其广泛的适用性和有效性。

English

The use of target networks is a popular approach for estimating value functions in deep Reinforcement Learning (RL). While effective, the target network remains a compromise solution that preserves stability at the cost of slowly moving targets, thus delaying learning. Conversely, using the online network as a bootstrapped target is intuitively appealing, albeit well-known to lead to unstable learning. In this work, we aim to obtain the best out of both worlds by introducing a novel update rule that computes the target using the MINimum estimate between the Target and Online network, giving rise to our method, MINTO. Through this simple, yet effective modification, we show that MINTO enables faster and stable value function learning, by mitigating the potential overestimation bias of using the online network for bootstrapping. Notably, MINTO can be seamlessly integrated into a wide range of value-based and actor-critic algorithms with a negligible cost. We evaluate MINTO extensively across diverse benchmarks, spanning online and offline RL, as well as discrete and continuous action spaces. Across all benchmarks, MINTO consistently improves performance, demonstrating its broad applicability and effectiveness.

如有可能，优先使用在线网络：迈向快速稳定的强化学习

Use the Online Network If You Can: Towards Fast and Stable Reinforcement Learning

摘要

Support