Utiliza la Red en Línea Si Puedes: Hacia un Aprendizaje por Refuerzo Rápido y Estable

Resumen

El uso de redes objetivo es un enfoque popular para estimar funciones de valor en el aprendizaje por refuerzo profundo (RL, por sus siglas en inglés). Aunque efectivo, la red objetivo sigue siendo una solución de compromiso que preserva la estabilidad a costa de objetivos que se mueven lentamente, lo que retrasa el aprendizaje. Por el contrario, utilizar la red en línea como objetivo de bootstrapping es intuitivamente atractivo, aunque es bien sabido que conduce a un aprendizaje inestable. En este trabajo, buscamos obtener lo mejor de ambos mundos introduciendo una nueva regla de actualización que calcula el objetivo utilizando la estimación MÍNima entre la red Objetivo y la red en Línea, dando lugar a nuestro método, MINTO. A través de esta modificación simple pero efectiva, demostramos que MINTO permite un aprendizaje más rápido y estable de la función de valor, mitigando el posible sesgo de sobreestimación al utilizar la red en línea para bootstrapping. Cabe destacar que MINTO puede integrarse sin problemas en una amplia gama de algoritmos basados en valor y de actor-crítico con un costo insignificante. Evaluamos MINTO extensamente en diversos puntos de referencia, abarcando RL en línea y fuera de línea, así como espacios de acción discretos y continuos. En todos los puntos de referencia, MINTO mejora consistentemente el rendimiento, demostrando su amplia aplicabilidad y efectividad.

English

The use of target networks is a popular approach for estimating value functions in deep Reinforcement Learning (RL). While effective, the target network remains a compromise solution that preserves stability at the cost of slowly moving targets, thus delaying learning. Conversely, using the online network as a bootstrapped target is intuitively appealing, albeit well-known to lead to unstable learning. In this work, we aim to obtain the best out of both worlds by introducing a novel update rule that computes the target using the MINimum estimate between the Target and Online network, giving rise to our method, MINTO. Through this simple, yet effective modification, we show that MINTO enables faster and stable value function learning, by mitigating the potential overestimation bias of using the online network for bootstrapping. Notably, MINTO can be seamlessly integrated into a wide range of value-based and actor-critic algorithms with a negligible cost. We evaluate MINTO extensively across diverse benchmarks, spanning online and offline RL, as well as discrete and continuous action spaces. Across all benchmarks, MINTO consistently improves performance, demonstrating its broad applicability and effectiveness.

Utiliza la Red en Línea Si Puedes: Hacia un Aprendizaje por Refuerzo Rápido y Estable

Use the Online Network If You Can: Towards Fast and Stable Reinforcement Learning

Resumen

Support