SimBa：簡單偏好用於擴大深度強化學習中的參數

摘要

最近在計算機視覺（CV）和自然語言處理（NLP）領域的進展主要是通過擴大網絡參數的數量推動的，儘管傳統理論表明更大的網絡容易出現過度擬合。這些大型網絡通過集成誘導簡單性偏差的組件來避免過度擬合，引導模型朝向簡單且可泛化的解決方案。然而，在深度強化學習（RL）中，設計和擴大網絡的研究相對較少。受到這一機遇的激發，我們提出了SimBa，一種旨在通過注入簡單性偏差來擴大深度RL參數的架構。SimBa由三個組件組成：（i）一個觀測歸一化層，通過運行統計信息標準化輸入，（ii）一個殘差前饋塊，提供從輸入到輸出的線性路徑，以及（iii）一個層歸一化，用於控制特徵的大小。通過SimBa擴大參數，各種深度RL算法的樣本效率（包括離線策略、在線策略和無監督方法）均得到持續改善。此外，僅通過將SimBa架構整合到SAC中，就在DMC、MyoSuite和HumanoidBench等環境中實現了與最先進的深度RL方法相匹配甚至超越的高計算效率。這些結果展示了SimBa在各種RL算法和環境中的廣泛適用性和有效性。

English

Recent advances in CV and NLP have been largely driven by scaling up the number of network parameters, despite traditional theories suggesting that larger networks are prone to overfitting. These large networks avoid overfitting by integrating components that induce a simplicity bias, guiding models toward simple and generalizable solutions. However, in deep RL, designing and scaling up networks have been less explored. Motivated by this opportunity, we present SimBa, an architecture designed to scale up parameters in deep RL by injecting a simplicity bias. SimBa consists of three components: (i) an observation normalization layer that standardizes inputs with running statistics, (ii) a residual feedforward block to provide a linear pathway from the input to output, and (iii) a layer normalization to control feature magnitudes. By scaling up parameters with SimBa, the sample efficiency of various deep RL algorithms-including off-policy, on-policy, and unsupervised methods-is consistently improved. Moreover, solely by integrating SimBa architecture into SAC, it matches or surpasses state-of-the-art deep RL methods with high computational efficiency across DMC, MyoSuite, and HumanoidBench. These results demonstrate SimBa's broad applicability and effectiveness across diverse RL algorithms and environments.

SimBa：簡單偏好用於擴大深度強化學習中的參數

SimBa: Simplicity Bias for Scaling Up Parameters in Deep Reinforcement Learning

摘要

Support