大语言模型预训练中的权重重缩放方差控制

摘要

大型语言模型（LLM）预训练的效果在很大程度上依赖于权重初始化与方差控制策略。尽管初始方差控制在神经网络中的重要性已得到广泛证实，但关于LLM预训练期间初始化及其增长管理的文献相对较少。本文提出了层索引重缩放（LIR）权重初始化方案和目标方差重缩放（TVR）方差控制策略。在拥有10亿参数的LLaMA模型上的实验表明，采用这些技术优化方差管理，显著提升了下游任务性能（在常见预训练基准上最高提升4.6%），并减少了极端激活值，从而缓解了量化与低精度训练相关的挑战。我们的代码已公开于：https://github.com/bluorion-com/weight_rescaling。

English

The outcome of Large Language Model (LLM) pre-training strongly depends on weight initialization and variance control strategies. Although the importance of initial variance control has been well documented in neural networks in general, the literature on initialization and management of its growth during LLM pre-training, specifically, is somewhat sparse. In this paper, we introduce the Layer Index Rescaling (LIR) weight initialization scheme, and the Target Variance Rescaling (TVR) variance control strategy. Experiments on a 1B parameter LLaMA model demonstrate that better variance management using these techniques yields substantial improvements in downstream task performance (up to 4.6% on common pre-training benchmarks) and reduces extreme activation values, thus mitigating challenges associated with quantization and low-precision training. Our code is available at: https://github.com/bluorion-com/weight_rescaling.

大语言模型预训练中的权重重缩放方差控制

Variance Control via Weight Rescaling in LLM Pre-training

摘要

Support