적절한 가중치 감쇠 조정을 통한 강건한 계층별 스케일링 규칙

초록

경험적 스케일링 법칙은 매개변수, 데이터, 컴퓨팅 자원을 어떻게 할당할지 규정하며, 최대 업데이트 매개변수화(muP)는 초기 업데이트 크기를 동일하게 함으로써 다양한 너비(폭) 간 학습률 전이를 가능하게 합니다. 그러나 현대의 스케일 불변 아키텍처에서는 학습이 빠르게 최적화자(optimizer)가 지배하는 정상 상태에 진입하며, 정규화 계층이 역전파 스케일 민감도를 생성하고 효과적인 학습률이 너비에 의존하게 되어 muP 전이를 저하시킵니다. 우리는 이를 해결하기 위해 AdamW에서 서브레이어 이득을 너비 간에 보존하는 가중치 감쇠(weight-decay) 스케일링 규칙을 도입했습니다. 경험적으로, 각 행렬 매개변수의 특이값 스펙트럼은 노름(norm) 기준으로 eta/lambda에 비례하며 그 형태는 거의 불변입니다. 너비 스케일링 d 하에서, 우리는 최상위 특이값이 대략 eta/lambda * d^{0.75}에 비례함을 관찰했습니다. 이 관찰을 행렬 유사 매개변수에 대한 muP 학습률 규칙 eta_2 ∝ d^{-1}과 결합하면, 경험적 가중치 감쇠 스케일링 규칙 lambda_2 ∝ d가 도출되어 서브레이어 이득을 거의 너비 불변으로 유지합니다. 벡터 유사 매개변수가 eta_1 = Theta_d(1) 및 lambda_1 = 0으로 학습될 때, 이는 프록시 너비에서 목표 너비로 학습률과 가중치 감쇠를 제로샷 전이하게 하여 너비별 탐색을 제거합니다. 우리는 이 규칙을 LLaMA 스타일 트랜스포머와 최소한의 합성 설정에서 검증했으며, 서브레이어 이득 불변성을 확인하기 위해 최상위 특이값을 매칭하는 간단한 진단 방법을 제공합니다. 우리의 결과는 muP를 초기 근처 영역을 넘어 최적화자가 설정한 정상 상태 스케일을 명시적으로 제어함으로써 확장하며, AdamW 하에서 너비에 강건한 하이퍼파라미터 전이를 위한 실용적인 방법을 제시합니다.

English

Empirical scaling laws prescribe how to allocate parameters, data, and compute, while maximal-update parameterization (muP) enables learning-rate transfer across widths by equalizing early-time update magnitudes. However, in modern scale-invariant architectures, training quickly enters an optimizer-governed steady state where normalization layers create backward scale sensitivity and the effective learning rate becomes width dependent, degrading muP transfer. We address this by introducing a weight-decay scaling rule for AdamW that preserves sublayer gain across widths. Empirically, the singular-value spectrum of each matrix parameter scales in norm as eta/lambda with an approximately invariant shape; under width scaling d, we observe that the top singular value scales approximately as eta/lambdacdot d^{0.75}. Combining this observation with the muP learning-rate rule eta_2propto d^{-1} for matrix-like parameters implies an empirical weight-decay scaling rule lambda_2propto d that approximately keeps sublayer gains width invariant. Together with vector-like parameters trained at eta_1=Theta_d(1) and lambda_1=0, this yields zero-shot transfer of both learning rate and weight decay from proxy to target widths, removing per-width sweeps. We validate the rule on LLaMA-style Transformers and in a minimal synthetic setting, and we provide a simple diagnostic, matching top singular values, to check sublayer-gain invariance. Our results extend muP beyond the near-init regime by explicitly controlling steady-state scales set by the optimizer, offering a practical recipe for width-robust hyperparameter transfer under AdamW.

적절한 가중치 감쇠 조정을 통한 강건한 계층별 스케일링 규칙

Robust Layerwise Scaling Rules by Proper Weight Decay Tuning

초록

Support