ChatPaper.aiChatPaper

基於價值的深度強化學習具有可預測的擴展性。

Value-Based Deep RL Scales Predictably

February 6, 2025
作者: Oleh Rybkin, Michal Nauman, Preston Fu, Charlie Snell, Pieter Abbeel, Sergey Levine, Aviral Kumar
cs.AI

摘要

擴展資料和計算對機器學習的成功至關重要。然而,擴展要求可預測性:我們希望方法不僅能夠在擁有更多計算或資料時表現良好,而且能夠從小規模運行中預測其性能,而無需運行大規模實驗。在本文中,我們展示了基於價值的離線策略強化學習方法是可預測的,儘管社區中流傳有關其病態行為的傳聞。首先,我們展示了為達到特定性能水平所需的資料和計算需求位於由更新對資料(UTD)比率控制的帕累托前沿上。通過估算這一前沿,我們可以在給定更多計算時預測此資料需求,並在給定更多資料時預測此計算需求。其次,我們確定了在給定性能的情況下將總資源預算分配到資料和計算的最佳方式,並將其用於確定最大化在給定預算下的性能的超參數。第三,這種擴展行為是通過首先估算超參數之間可預測的關係來實現的,這些關係用於管理強化學習中獨有的過度擬合和可塑性損失的影響。我們使用三種算法:SAC、BRO 和 PQL 在 DeepMind Control、OpenAI gym 和 IsaacGym 上驗證了我們的方法,當推斷到更高水平的資料、計算、預算或性能時。
English
Scaling data and compute is critical to the success of machine learning. However, scaling demands predictability: we want methods to not only perform well with more compute or data, but also have their performance be predictable from small-scale runs, without running the large-scale experiment. In this paper, we show that value-based off-policy RL methods are predictable despite community lore regarding their pathological behavior. First, we show that data and compute requirements to attain a given performance level lie on a Pareto frontier, controlled by the updates-to-data (UTD) ratio. By estimating this frontier, we can predict this data requirement when given more compute, and this compute requirement when given more data. Second, we determine the optimal allocation of a total resource budget across data and compute for a given performance and use it to determine hyperparameters that maximize performance for a given budget. Third, this scaling behavior is enabled by first estimating predictable relationships between hyperparameters, which is used to manage effects of overfitting and plasticity loss unique to RL. We validate our approach using three algorithms: SAC, BRO, and PQL on DeepMind Control, OpenAI gym, and IsaacGym, when extrapolating to higher levels of data, compute, budget, or performance.

Summary

AI-Generated Summary

PDF65February 10, 2025