若條件允許,請使用線上網路:邁向快速且穩定的強化學習
Use the Online Network If You Can: Towards Fast and Stable Reinforcement Learning
October 2, 2025
作者: Ahmed Hendawy, Henrik Metternich, Théo Vincent, Mahdi Kallel, Jan Peters, Carlo D'Eramo
cs.AI
摘要
在深度強化學習(RL)中,使用目標網絡是估計價值函數的一種流行方法。雖然有效,但目標網絡仍是一種折衷方案,它通過保持目標的緩慢移動來維持穩定性,從而延緩了學習進程。相反,使用在線網絡作為自舉目標在直觀上更具吸引力,儘管眾所周知這會導致學習不穩定。在本研究中,我們旨在通過引入一種新穎的更新規則來兼顧兩者之長,該規則利用目標網絡和在線網絡之間的最小估計值來計算目標,由此產生了我們的方法——MINTO。通過這一簡單而有效的修改,我們展示了MINTO能夠通過減輕使用在線網絡進行自舉可能帶來的過高估計偏差,實現更快且穩定的價值函數學習。值得注意的是,MINTO可以無縫集成到多種基於價值和演員-評論家算法中,且成本極低。我們在多樣化的基準測試中廣泛評估了MINTO,涵蓋了在線和離線RL,以及離散和連續動作空間。在所有基準測試中,MINTO均一致提升了性能,證明了其廣泛的適用性和有效性。
English
The use of target networks is a popular approach for estimating value
functions in deep Reinforcement Learning (RL). While effective, the target
network remains a compromise solution that preserves stability at the cost of
slowly moving targets, thus delaying learning. Conversely, using the online
network as a bootstrapped target is intuitively appealing, albeit well-known to
lead to unstable learning. In this work, we aim to obtain the best out of both
worlds by introducing a novel update rule that computes the target using the
MINimum estimate between the Target and Online network, giving rise to our
method, MINTO. Through this simple, yet effective modification, we show that
MINTO enables faster and stable value function learning, by mitigating the
potential overestimation bias of using the online network for bootstrapping.
Notably, MINTO can be seamlessly integrated into a wide range of value-based
and actor-critic algorithms with a negligible cost. We evaluate MINTO
extensively across diverse benchmarks, spanning online and offline RL, as well
as discrete and continuous action spaces. Across all benchmarks, MINTO
consistently improves performance, demonstrating its broad applicability and
effectiveness.