SimBa：简洁偏好用于扩展深度强化学习中的参数

摘要

最近在计算机视觉（CV）和自然语言处理（NLP）领域的进展主要是通过增加网络参数的数量来推动的，尽管传统理论表明更大的网络容易出现过拟合现象。这些大型网络通过整合引入简单性偏差的组件来避免过拟合，引导模型朝向简单且可泛化的解决方案。然而，在深度强化学习（RL）领域，设计和扩展网络的研究相对较少。受到这一机遇的启发，我们提出了SimBa，一种旨在通过引入简单性偏差来扩展深度RL参数的架构。SimBa由三个组件组成：（i）一个观察规范化层，使用运行统计数据标准化输入，（ii）一个残差前馈块，提供从输入到输出的线性路径，以及（iii）一个层规范化层，用于控制特征的大小。通过SimBa扩展参数，各种深度RL算法的样本效率得到了持续改善，包括离策略、在策略和无监督方法。此外，仅通过将SimBa架构集成到SAC中，就能够在DMC、MyoSuite和HumanoidBench等环境中以高计算效率匹敌或超越最先进的深度RL方法。这些结果展示了SimBa在不同RL算法和环境中的广泛适用性和有效性。

English

Recent advances in CV and NLP have been largely driven by scaling up the number of network parameters, despite traditional theories suggesting that larger networks are prone to overfitting. These large networks avoid overfitting by integrating components that induce a simplicity bias, guiding models toward simple and generalizable solutions. However, in deep RL, designing and scaling up networks have been less explored. Motivated by this opportunity, we present SimBa, an architecture designed to scale up parameters in deep RL by injecting a simplicity bias. SimBa consists of three components: (i) an observation normalization layer that standardizes inputs with running statistics, (ii) a residual feedforward block to provide a linear pathway from the input to output, and (iii) a layer normalization to control feature magnitudes. By scaling up parameters with SimBa, the sample efficiency of various deep RL algorithms-including off-policy, on-policy, and unsupervised methods-is consistently improved. Moreover, solely by integrating SimBa architecture into SAC, it matches or surpasses state-of-the-art deep RL methods with high computational efficiency across DMC, MyoSuite, and HumanoidBench. These results demonstrate SimBa's broad applicability and effectiveness across diverse RL algorithms and environments.

SimBa：简洁偏好用于扩展深度强化学习中的参数

SimBa: Simplicity Bias for Scaling Up Parameters in Deep Reinforcement Learning

摘要

Support