オンラインネットワークを活用せよ：高速かつ安定した強化学習に向けて

要旨

ターゲットネットワークの使用は、深層強化学習（RL）における価値関数の推定において広く用いられる手法である。効果的ではあるものの、ターゲットネットワークは安定性を保つ代わりにターゲットの更新が遅くなるという妥協的な解決策であり、学習の遅延を引き起こす。一方、オンラインネットワークをブートストラップターゲットとして使用することは直感的に魅力的であるが、学習が不安定になることがよく知られている。本研究では、ターゲットネットワークとオンラインネットワークの間の最小推定値（MINimum estimate）を用いてターゲットを計算する新しい更新ルールを導入し、MINTOという手法を提案する。このシンプルでありながら効果的な修正を通じて、MINTOがオンラインネットワークを使用したブートストラップによる過大評価バイアスを緩和し、より速く安定した価値関数の学習を可能にすることを示す。特に、MINTOは無視できる程度のコストで、幅広い価値ベースおよびアクター・クリティックアルゴリズムにシームレスに統合できる。MINTOをオンラインRL、オフラインRL、離散および連続行動空間にわたる多様なベンチマークで広範に評価した結果、すべてのベンチマークにおいてMINTOが一貫して性能を向上させ、その汎用性と有効性を実証した。

English

The use of target networks is a popular approach for estimating value functions in deep Reinforcement Learning (RL). While effective, the target network remains a compromise solution that preserves stability at the cost of slowly moving targets, thus delaying learning. Conversely, using the online network as a bootstrapped target is intuitively appealing, albeit well-known to lead to unstable learning. In this work, we aim to obtain the best out of both worlds by introducing a novel update rule that computes the target using the MINimum estimate between the Target and Online network, giving rise to our method, MINTO. Through this simple, yet effective modification, we show that MINTO enables faster and stable value function learning, by mitigating the potential overestimation bias of using the online network for bootstrapping. Notably, MINTO can be seamlessly integrated into a wide range of value-based and actor-critic algorithms with a negligible cost. We evaluate MINTO extensively across diverse benchmarks, spanning online and offline RL, as well as discrete and continuous action spaces. Across all benchmarks, MINTO consistently improves performance, demonstrating its broad applicability and effectiveness.

オンラインネットワークを活用せよ：高速かつ安定した強化学習に向けて

Use the Online Network If You Can: Towards Fast and Stable Reinforcement Learning

要旨

Support