可学习乘子:释放语言模型矩阵层的尺度约束
Learnable Multipliers: Freeing the Scale of Language Model Matrix Layers
January 8, 2026
作者: Maksim Velikanov, Ilyas Chahed, Jingwei Zuo, Dhia Eddine Rhaiem, Younes Belkada, Hakim Hacid
cs.AI
摘要
在大型语言模型预训练中,对矩阵层施加权重衰减(WD)是标准做法。已有研究表明,随机梯度噪声会引发权重矩阵W的类布朗运动式扩张,而权重衰减通过抑制这种扩张,最终形成具有特定权重范数||W||的WD-噪声平衡态。本研究将该平衡态下的范数视为训练过程的有害伪影,并通过引入可学习乘子来优化尺度参数。首先,我们在W上附加可学习的标量乘子,证实了WD-噪声平衡范数存在次优性:学习到的尺度能自适应数据并提升性能。进而我们论证了行列范数存在类似约束,通过引入可学习的行/列乘子释放其尺度约束。我们的方法可视为对muP乘子的一种可学习、更高表达能力的泛化。该方法在优化良好的muP基线基础上实现性能提升,降低了乘子调优的计算开销,并引发出前向传播对称性、学习乘子的宽度缩放等实践性问题。最后,我们在Adam和Muon优化器上验证了可学习乘子的有效性,其在下游评估中的改进幅度与从Adam切换到Muon所带来的提升相当。
English
Applying weight decay (WD) to matrix layers is standard practice in large-language-model pretraining. Prior work suggests that stochastic gradient noise induces a Brownian-like expansion of the weight matrices W, whose growth is counteracted by WD, leading to a WD-noise equilibrium with a certain weight norm ||W||. In this work, we view the equilibrium norm as a harmful artifact of the training procedure, and address it by introducing learnable multipliers to learn the optimal scale. First, we attach a learnable scalar multiplier to W and confirm that the WD-noise equilibrium norm is suboptimal: the learned scale adapts to data and improves performance. We then argue that individual row and column norms are similarly constrained, and free their scale by introducing learnable per-row and per-column multipliers. Our method can be viewed as a learnable, more expressive generalization of muP multipliers. It outperforms a well-tuned muP baseline, reduces the computational overhead of multiplier tuning, and surfaces practical questions such as forward-pass symmetries and the width-scaling of the learned multipliers. Finally, we validate learnable multipliers with both Adam and Muon optimizers, where it shows improvement in downstream evaluations matching the improvement of the switching from Adam to Muon.