SimBa:簡單偏好用於擴大深度強化學習中的參數
SimBa: Simplicity Bias for Scaling Up Parameters in Deep Reinforcement Learning
October 13, 2024
作者: Hojoon Lee, Dongyoon Hwang, Donghu Kim, Hyunseung Kim, Jun Jet Tai, Kaushik Subramanian, Peter R. Wurman, Jaegul Choo, Peter Stone, Takuma Seno
cs.AI
摘要
最近在計算機視覺(CV)和自然語言處理(NLP)領域的進展主要是通過擴大網絡參數的數量推動的,儘管傳統理論表明更大的網絡容易出現過度擬合。這些大型網絡通過集成誘導簡單性偏差的組件來避免過度擬合,引導模型朝向簡單且可泛化的解決方案。然而,在深度強化學習(RL)中,設計和擴大網絡的研究相對較少。受到這一機遇的激發,我們提出了SimBa,一種旨在通過注入簡單性偏差來擴大深度RL參數的架構。SimBa由三個組件組成:(i)一個觀測歸一化層,通過運行統計信息標準化輸入,(ii)一個殘差前饋塊,提供從輸入到輸出的線性路徑,以及(iii)一個層歸一化,用於控制特徵的大小。通過SimBa擴大參數,各種深度RL算法的樣本效率(包括離線策略、在線策略和無監督方法)均得到持續改善。此外,僅通過將SimBa架構整合到SAC中,就在DMC、MyoSuite和HumanoidBench等環境中實現了與最先進的深度RL方法相匹配甚至超越的高計算效率。這些結果展示了SimBa在各種RL算法和環境中的廣泛適用性和有效性。
English
Recent advances in CV and NLP have been largely driven by scaling up the
number of network parameters, despite traditional theories suggesting that
larger networks are prone to overfitting. These large networks avoid
overfitting by integrating components that induce a simplicity bias, guiding
models toward simple and generalizable solutions. However, in deep RL,
designing and scaling up networks have been less explored. Motivated by this
opportunity, we present SimBa, an architecture designed to scale up parameters
in deep RL by injecting a simplicity bias. SimBa consists of three components:
(i) an observation normalization layer that standardizes inputs with running
statistics, (ii) a residual feedforward block to provide a linear pathway from
the input to output, and (iii) a layer normalization to control feature
magnitudes. By scaling up parameters with SimBa, the sample efficiency of
various deep RL algorithms-including off-policy, on-policy, and unsupervised
methods-is consistently improved. Moreover, solely by integrating SimBa
architecture into SAC, it matches or surpasses state-of-the-art deep RL methods
with high computational efficiency across DMC, MyoSuite, and HumanoidBench.
These results demonstrate SimBa's broad applicability and effectiveness across
diverse RL algorithms and environments.