尺寸微小，效果显著：论大语言模型中的尺度向量

摘要

现代大型语言模型（LLMs）中的归一化层由确定性归一化操作和可学习的尺度向量组成。尽管归一化操作已被广泛研究，但尺度向量虽被普遍使用，其作用机制仍鲜为人知。本研究从可表达性、优化过程和架构结构三个角度，对LLMs中的尺度向量展开了系统性分析。首先，我们通过实验证明，虽然尺度向量仅占模型参数的极小比例，但移除它们会显著损害LLM的预训练效果。理论分析进一步表明，在Pre-Norm架构中，尺度向量并未提升可表达性；相反，它们通过对后续线性映射产生自增强型预处理效应来改善优化过程。其次，我们探究了权重衰减对尺度向量的作用。通过区分Input-Norm层与Output-Norm层，理论上证明了由于这两类层在优化和可表达性中扮演不同角色，权重衰减对前者有益但对后者有害。基于这一认知，我们提出了三种轻量级且互补的尺度向量改进方案：分支特异性异质性、围绕线性映射的优化放置方式，以及幅度-方向重参数化。理论与实验均证实每种改进均能带来一致性收益。最终，我们将这些改进整合为统一的尺度向量策略，并在0.12B至2B参数规模的密集模型与混合专家模型上，通过工业级token预算下的多优化器、多学习率调度方案进行了大规模LLM预训练实验评估。该统一策略不仅始终获得优于精心调优基线的最终损失值，且展现出更优的扩展行为，同时仅增加可忽略的参数与计算开销。

English

Normalization layers in modern large language models (LLMs) consist of a deterministic normalization operation and a learnable scale vector. While the normalization operation has been extensively studied, the scale vector remains poorly understood despite its ubiquitous use. In this work, we present a systematic study of scale vectors in LLMs from the perspectives of expressivity, optimization, and architectural structure. First, we show empirically that although scale vectors constitute only a negligible fraction of model parameters, removing them substantially degrades LLM pre-training. Our theory further shows that, in Pre-Norm architectures, scale vectors do not increase expressivity; instead, they improve optimization through a self-amplifying preconditioning effect on subsequent linear mappings. Second, we investigate the role of weight decay for scale vectors. By distinguishing Input-Norm and Output-Norm layers, we theoretically show that weight decay is beneficial for the former but harmful for the latter, due to their distinct roles in optimization and expressivity. Third, motivated by this understanding, we propose three lightweight and complementary improvements to scale vectors: branch-specific heterogeneity, improved placement around linear mappings, and magnitude-direction reparameterization. Both theory and experiments show that each improvement yields consistent gains. Finally, we combine these improvements into a unified scale-vector strategy and evaluate it through extensive LLM pre-training experiments on dense and mixture-of-experts models ranging from 0.12B to 2B parameters, across multiple optimizers and learning rate schedules, under industrial-scale token budgets. The unified strategy consistently achieves lower terminal loss than well-tuned baselines and exhibits more favorable scaling behavior, while adding negligible parameter and computational overhead.