可學習乘數:釋放語言模型矩陣層的尺度限制
Learnable Multipliers: Freeing the Scale of Language Model Matrix Layers
January 8, 2026
作者: Maksim Velikanov, Ilyas Chahed, Jingwei Zuo, Dhia Eddine Rhaiem, Younes Belkada, Hakim Hacid
cs.AI
摘要
在大型語言模型預訓練中,對矩陣層施加權重衰減是標準做法。先前研究指出,隨機梯度噪聲會引發權重矩陣W的類布朗運動擴張,而權重衰減會抑制這種擴張,從而形成具有特定權重範數||W||的權重衰減-噪聲平衡態。本研究將此平衡態範數視為訓練過程中的有害產物,並通過引入可學習的乘數來尋找最佳尺度以解決此問題。首先,我們為W附加可學習的標量乘數,證實權重衰減-噪聲平衡態的範數具有次優性:學習到的尺度能根據數據自適應調整並提升性能。我們進一步論證個別行與列範數同樣受此制約,因此引入可學習的行乘數與列乘數來釋放其尺度自由度。我們的方法可視為對muP乘數的可學習化、更具表達力的泛化。該方法不僅勝過精心調參的muP基線、降低乘數調參的計算開銷,更引發對前向傳遞對稱性及學習乘數的寬度縮放等實際問題的探討。最終,我們在Adam與Muon優化器上均驗證了可學習乘數的有效性,其在下游任務評估中的改進幅度相當於從Adam切換至Muon所帶來的提升。
English
Applying weight decay (WD) to matrix layers is standard practice in large-language-model pretraining. Prior work suggests that stochastic gradient noise induces a Brownian-like expansion of the weight matrices W, whose growth is counteracted by WD, leading to a WD-noise equilibrium with a certain weight norm ||W||. In this work, we view the equilibrium norm as a harmful artifact of the training procedure, and address it by introducing learnable multipliers to learn the optimal scale. First, we attach a learnable scalar multiplier to W and confirm that the WD-noise equilibrium norm is suboptimal: the learned scale adapts to data and improves performance. We then argue that individual row and column norms are similarly constrained, and free their scale by introducing learnable per-row and per-column multipliers. Our method can be viewed as a learnable, more expressive generalization of muP multipliers. It outperforms a well-tuned muP baseline, reduces the computational overhead of multiplier tuning, and surfaces practical questions such as forward-pass symmetries and the width-scaling of the learned multipliers. Finally, we validate learnable multipliers with both Adam and Muon optimizers, where it shows improvement in downstream evaluations matching the improvement of the switching from Adam to Muon.