ChatPaper.aiChatPaper

通過適當的權重衰減調整實現的穩健層級縮放規則

Robust Layerwise Scaling Rules by Proper Weight Decay Tuning

October 17, 2025
作者: Zhiyuan Fan, Yifeng Liu, Qingyue Zhao, Angela Yuan, Quanquan Gu
cs.AI

摘要

經驗性尺度定律規定了如何分配參數、數據和計算資源,而最大更新參數化(muP)則通過均衡早期更新幅度實現了跨寬度的學習率遷移。然而,在現代尺度不變架構中,訓練迅速進入由優化器主導的穩態,其中歸一化層引入了反向尺度敏感性,使得有效學習率依賴於網絡寬度,從而削弱了muP的遷移效果。為解決此問題,我們針對AdamW引入了一種權重衰減尺度規則,該規則能夠保持子層增益在寬度變化下的不變性。實證表明,每個矩陣參數的奇異值譜在範數上按eta/lambda比例縮放,且形狀大致不變;在寬度縮放d的情況下,我們觀察到頂部奇異值近似按eta/lambda·d^{0.75}的比例縮放。將這一觀察與muP學習率規則eta_2∝d^{-1}(適用於矩陣類參數)相結合,推導出一種經驗性的權重衰減尺度規則lambda_2∝d,該規則近似保持了子層增益的寬度不變性。結合以eta_1=Θ_d(1)和lambda_1=0訓練的向量類參數,這一規則實現了從代理寬度到目標寬度的學習率和權重衰減的零樣本遷移,消除了逐寬度搜索的需求。我們在LLaMA風格的Transformer模型及一個最小化合成設置中驗證了該規則,並提供了一種簡單的診斷方法——匹配頂部奇異值,以檢查子層增益的不變性。我們的研究成果通過顯式控制由優化器設定的穩態尺度,將muP的應用範圍擴展至近初始化階段之外,為在AdamW下實現寬度魯棒的超參數遷移提供了實用指南。
English
Empirical scaling laws prescribe how to allocate parameters, data, and compute, while maximal-update parameterization (muP) enables learning-rate transfer across widths by equalizing early-time update magnitudes. However, in modern scale-invariant architectures, training quickly enters an optimizer-governed steady state where normalization layers create backward scale sensitivity and the effective learning rate becomes width dependent, degrading muP transfer. We address this by introducing a weight-decay scaling rule for AdamW that preserves sublayer gain across widths. Empirically, the singular-value spectrum of each matrix parameter scales in norm as eta/lambda with an approximately invariant shape; under width scaling d, we observe that the top singular value scales approximately as eta/lambdacdot d^{0.75}. Combining this observation with the muP learning-rate rule eta_2propto d^{-1} for matrix-like parameters implies an empirical weight-decay scaling rule lambda_2propto d that approximately keeps sublayer gains width invariant. Together with vector-like parameters trained at eta_1=Theta_d(1) and lambda_1=0, this yields zero-shot transfer of both learning rate and weight decay from proxy to target widths, removing per-width sweeps. We validate the rule on LLaMA-style Transformers and in a minimal synthetic setting, and we provide a simple diagnostic, matching top singular values, to check sublayer-gain invariance. Our results extend muP beyond the near-init regime by explicitly controlling steady-state scales set by the optimizer, offering a practical recipe for width-robust hyperparameter transfer under AdamW.
PDF43October 20, 2025