通过适当权重衰减调优的鲁棒分层缩放规则
Robust Layerwise Scaling Rules by Proper Weight Decay Tuning
October 17, 2025
作者: Zhiyuan Fan, Yifeng Liu, Qingyue Zhao, Angela Yuan, Quanquan Gu
cs.AI
摘要
经验性扩展法则规定了如何分配参数、数据和计算资源,而最大更新参数化(muP)通过均衡早期更新幅度实现了跨宽度的学习率迁移。然而,在现代尺度不变架构中,训练迅速进入由优化器主导的稳态,其中归一化层引入了反向尺度敏感性,导致有效学习率依赖于宽度,从而削弱了muP的迁移效果。我们通过为AdamW引入一种权重衰减缩放规则来解决这一问题,该规则保持了跨宽度的子层增益不变。实证表明,每个矩阵参数的奇异值谱在范数上按eta/lambda缩放,且形状大致不变;在宽度缩放d下,我们观察到顶部奇异值大约按eta/lambda * d^{0.75}缩放。结合muP学习率规则eta_2∝d^{-1}对于矩阵类参数,意味着一个经验性的权重衰减缩放规则lambda_2∝d,该规则近似保持了子层增益的宽度不变性。与在eta_1=Θ_d(1)和lambda_1=0下训练的向量类参数相结合,这实现了从代理宽度到目标宽度的学习率和权重衰减的零样本迁移,消除了逐宽度搜索的需求。我们在LLaMA风格的Transformer模型和一个最小化合成环境中验证了这一规则,并提供了一个简单的诊断方法——匹配顶部奇异值,以检查子层增益的不变性。我们的成果通过显式控制优化器设定的稳态尺度,将muP的应用范围扩展到了近初始化阶段之外,为AdamW下的宽度鲁棒超参数迁移提供了实用方案。
English
Empirical scaling laws prescribe how to allocate parameters, data, and
compute, while maximal-update parameterization (muP) enables learning-rate
transfer across widths by equalizing early-time update magnitudes. However, in
modern scale-invariant architectures, training quickly enters an
optimizer-governed steady state where normalization layers create backward
scale sensitivity and the effective learning rate becomes width dependent,
degrading muP transfer. We address this by introducing a weight-decay
scaling rule for AdamW that preserves sublayer gain across widths. Empirically,
the singular-value spectrum of each matrix parameter scales in norm as
eta/lambda with an approximately invariant shape; under width
scaling d, we observe that the top singular value scales approximately as
eta/lambdacdot d^{0.75}. Combining this observation with the muP
learning-rate rule eta_2propto d^{-1} for matrix-like parameters implies an
empirical weight-decay scaling rule lambda_2propto d that
approximately keeps sublayer gains width invariant. Together with vector-like
parameters trained at eta_1=Theta_d(1) and lambda_1=0, this yields
zero-shot transfer of both learning rate and weight decay from proxy to
target widths, removing per-width sweeps. We validate the rule on LLaMA-style
Transformers and in a minimal synthetic setting, and we provide a simple
diagnostic, matching top singular values, to check sublayer-gain invariance.
Our results extend muP beyond the near-init regime by explicitly controlling
steady-state scales set by the optimizer, offering a practical recipe for
width-robust hyperparameter transfer under AdamW.