サイズは無視できるほど小さいが、効果は顕著：大規模言語モデルにおけるスケールベクトルについて

要旨

現代の大規模言語モデル（LLM）における正規化層は、決定論的な正規化演算と学習可能なスケールベクトルから構成される。正規化演算は広く研究されている一方、スケールベクトルはその広範な利用にもかかわらず、十分に理解されていない。本稿では、LLMにおけるスケールベクトルについて、表現力、最適化、アーキテクチャ構造の観点から体系的な研究を行う。第一に、スケールベクトルはモデルパラメータのごくわずかな割合を占めるに過ぎないが、それを除去するとLLMの事前学習が大幅に劣化することを実験的に示す。さらに理論によって、Pre-Normアーキテクチャにおいてスケールベクトルは表現力を高めるのではなく、後続の線形写像に対する自己増幅型のプレコンディショニング効果を通じて最適化を改善することを明らかにする。第二に、スケールベクトルに対する重み減衰の役割を調査する。Input-Norm層とOutput-Norm層を区別し、それらが最適化と表現力において異なる役割を果たすことから、重み減衰は前者には有益であるが後者には有害であることを理論的に示す。第三に、この理解に基づき、ブランチ固有の異質性、線形写像周辺の配置改善、大きさ-方向の再パラメータ化という3つの軽量かつ相補的な改善策をスケールベクトルに提案する。理論と実験の両方により、各改善策が一貫した利得をもたらすことを確認する。最後に、これらの改善策を統合したスケールベクトル戦略にまとめ、0.12Bから2Bパラメータの高密度モデルおよび混合専門家モデルに対して、複数の最適化手法と学習率スケジュールを用い、産業規模のトークンバジェットの下で広範なLLM事前学習実験により評価する。統合戦略は、十分に調整されたベースラインよりも一貫して低い最終損失を達成し、より好ましいスケーリング挙動を示す一方で、パラメータと計算のオーバーヘッドは無視できる程度である。

English

Normalization layers in modern large language models (LLMs) consist of a deterministic normalization operation and a learnable scale vector. While the normalization operation has been extensively studied, the scale vector remains poorly understood despite its ubiquitous use. In this work, we present a systematic study of scale vectors in LLMs from the perspectives of expressivity, optimization, and architectural structure. First, we show empirically that although scale vectors constitute only a negligible fraction of model parameters, removing them substantially degrades LLM pre-training. Our theory further shows that, in Pre-Norm architectures, scale vectors do not increase expressivity; instead, they improve optimization through a self-amplifying preconditioning effect on subsequent linear mappings. Second, we investigate the role of weight decay for scale vectors. By distinguishing Input-Norm and Output-Norm layers, we theoretically show that weight decay is beneficial for the former but harmful for the latter, due to their distinct roles in optimization and expressivity. Third, motivated by this understanding, we propose three lightweight and complementary improvements to scale vectors: branch-specific heterogeneity, improved placement around linear mappings, and magnitude-direction reparameterization. Both theory and experiments show that each improvement yields consistent gains. Finally, we combine these improvements into a unified scale-vector strategy and evaluate it through extensive LLM pre-training experiments on dense and mixture-of-experts models ranging from 0.12B to 2B parameters, across multiple optimizers and learning rate schedules, under industrial-scale token budgets. The unified strategy consistently achieves lower terminal loss than well-tuned baselines and exhibits more favorable scaling behavior, while adding negligible parameter and computational overhead.