大语言模型预训练中的权重重缩放方差控制
Variance Control via Weight Rescaling in LLM Pre-training
March 21, 2025
作者: Louis Owen, Abhay Kumar, Nilabhra Roy Chowdhury, Fabian Güra
cs.AI
摘要
大型语言模型(LLM)预训练的效果在很大程度上依赖于权重初始化与方差控制策略。尽管初始方差控制在神经网络中的重要性已得到广泛证实,但关于LLM预训练期间初始化及其增长管理的文献相对较少。本文提出了层索引重缩放(LIR)权重初始化方案和目标方差重缩放(TVR)方差控制策略。在拥有10亿参数的LLaMA模型上的实验表明,采用这些技术优化方差管理,显著提升了下游任务性能(在常见预训练基准上最高提升4.6%),并减少了极端激活值,从而缓解了量化与低精度训练相关的挑战。我们的代码已公开于:https://github.com/bluorion-com/weight_rescaling。
English
The outcome of Large Language Model (LLM) pre-training strongly depends on
weight initialization and variance control strategies. Although the importance
of initial variance control has been well documented in neural networks in
general, the literature on initialization and management of its growth during
LLM pre-training, specifically, is somewhat sparse. In this paper, we introduce
the Layer Index Rescaling (LIR) weight initialization scheme, and the Target
Variance Rescaling (TVR) variance control strategy. Experiments on a 1B
parameter LLaMA model demonstrate that better variance management using these
techniques yields substantial improvements in downstream task performance (up
to 4.6% on common pre-training benchmarks) and reduces extreme activation
values, thus mitigating challenges associated with quantization and
low-precision training. Our code is available at:
https://github.com/bluorion-com/weight_rescaling.Summary
AI-Generated Summary