規模分佈解耦:實現大型語言模型的穩定高效訓練
Scale-Distribution Decoupling: Enabling Stable and Effective Training of Large Language Models
February 21, 2025
作者: Ya Wang, Zhijian Zhuo, Yutao Zeng, Xun Zhou, Jian Yang, Xiaoqing Li
cs.AI
摘要
在大型語言模型(LLM)的預訓練過程中,訓練穩定性是一個持續存在的挑戰,尤其是對於Post-Norm Transformer等架構,這些架構容易出現梯度爆炸和消散的問題。本文提出了一種新穎的方法——尺度分佈解耦(Scale-Distribution Decoupling, SDD),通過顯式地解耦全連接層中權重矩陣的尺度和分佈來穩定訓練。SDD應用了一種歸一化機制來調節激活值,並使用可學習的縮放向量來維持良好的梯度條件,從而有效防止梯度爆炸和消散。這種分離通過確保穩定的梯度傳播,特別是在深度網絡中,提高了優化效率。實驗結果表明,我們的方法在各種LLM架構中均能穩定訓練,並且在不同歸一化配置下優於現有技術。此外,所提出的方法輕量且與現有框架兼容,使其成為穩定LLM訓練的實用解決方案。代碼可在https://github.com/kaihemo/SDD獲取。
English
Training stability is a persistent challenge in the pre-training of large
language models (LLMs), particularly for architectures such as Post-Norm
Transformers, which are prone to gradient explosion and dissipation. In this
paper, we propose Scale-Distribution Decoupling (SDD), a novel approach that
stabilizes training by explicitly decoupling the scale and distribution of the
weight matrix in fully-connected layers. SDD applies a normalization mechanism
to regulate activations and a learnable scaling vector to maintain
well-conditioned gradients, effectively preventing gradient explosion
and dissipation. This separation improves optimization efficiency,
particularly in deep networks, by ensuring stable gradient propagation.
Experimental results demonstrate that our method stabilizes training across
various LLM architectures and outperforms existing techniques in different
normalization configurations. Furthermore, the proposed method is lightweight
and compatible with existing frameworks, making it a practical solution for
stabilizing LLM training. Code is available at https://github.com/kaihemo/SDD.Summary
AI-Generated Summary