體積微不足道，效果顯著：論大型語言模型中的尺度向量

摘要

現代大型語言模型中的歸一化層包含確定性歸一化運算與可學習的縮放向量。儘管歸一化運算已獲得廣泛研究，但縮放向量儘管被普遍使用，其機制仍未被充分理解。本研究從表現力、優化及架構設計三個角度，對大型語言模型中的縮放向量進行系統性探討。首先，我們通過實驗證明，雖然縮放向量僅佔模型參數的極小比例，移除它們會顯著損害大型語言模型的預訓練效果。理論分析進一步指出，在用於前向歸一化的架構（Pre-Norm）中，縮放向量並未提升表現力，而是透過對後續線性映射產生「自我放大預調節效應」來改善優化過程。其次，我們探討了權重衰減對縮放向量的影響。通過區分輸入歸一化層與輸出歸一化層，我們從理論上證明：由於兩者在優化與表現力中的角色不同，權重衰減對前者有益，對後者卻有害。第三，基於上述理解，我們提出三項輕量且互補的縮放向量改進策略：分支特定異質性、圍繞線性映射的放置位置優化，以及幅度-方向重參數化。理論與實驗均顯示每項改進都能帶來一致的性能提升。最後，我們將這些改進整合為統一的縮放向量策略，並針對密集型與混合專家模型（參數量從0.12B到2B），採用多種優化器與學習率排程，在工業級標記預算下進行大規模預訓練實驗。結果顯示，該統一策略在終端損失上持續優於精心調校的基準模型，並展現更優異的擴展行為，同時僅增加可忽略的參數與計算開銷。

English

Normalization layers in modern large language models (LLMs) consist of a deterministic normalization operation and a learnable scale vector. While the normalization operation has been extensively studied, the scale vector remains poorly understood despite its ubiquitous use. In this work, we present a systematic study of scale vectors in LLMs from the perspectives of expressivity, optimization, and architectural structure. First, we show empirically that although scale vectors constitute only a negligible fraction of model parameters, removing them substantially degrades LLM pre-training. Our theory further shows that, in Pre-Norm architectures, scale vectors do not increase expressivity; instead, they improve optimization through a self-amplifying preconditioning effect on subsequent linear mappings. Second, we investigate the role of weight decay for scale vectors. By distinguishing Input-Norm and Output-Norm layers, we theoretically show that weight decay is beneficial for the former but harmful for the latter, due to their distinct roles in optimization and expressivity. Third, motivated by this understanding, we propose three lightweight and complementary improvements to scale vectors: branch-specific heterogeneity, improved placement around linear mappings, and magnitude-direction reparameterization. Both theory and experiments show that each improvement yields consistent gains. Finally, we combine these improvements into a unified scale-vector strategy and evaluate it through extensive LLM pre-training experiments on dense and mixture-of-experts models ranging from 0.12B to 2B parameters, across multiple optimizers and learning rate schedules, under industrial-scale token budgets. The unified strategy consistently achieves lower terminal loss than well-tuned baselines and exhibits more favorable scaling behavior, while adding negligible parameter and computational overhead.