規模分佈解耦：實現大型語言模型的穩定高效訓練

摘要

在大型語言模型（LLM）的預訓練過程中，訓練穩定性是一個持續存在的挑戰，尤其是對於Post-Norm Transformer等架構，這些架構容易出現梯度爆炸和消散的問題。本文提出了一種新穎的方法——尺度分佈解耦（Scale-Distribution Decoupling, SDD），通過顯式地解耦全連接層中權重矩陣的尺度和分佈來穩定訓練。SDD應用了一種歸一化機制來調節激活值，並使用可學習的縮放向量來維持良好的梯度條件，從而有效防止梯度爆炸和消散。這種分離通過確保穩定的梯度傳播，特別是在深度網絡中，提高了優化效率。實驗結果表明，我們的方法在各種LLM架構中均能穩定訓練，並且在不同歸一化配置下優於現有技術。此外，所提出的方法輕量且與現有框架兼容，使其成為穩定LLM訓練的實用解決方案。代碼可在https://github.com/kaihemo/SDD獲取。

English

Training stability is a persistent challenge in the pre-training of large language models (LLMs), particularly for architectures such as Post-Norm Transformers, which are prone to gradient explosion and dissipation. In this paper, we propose Scale-Distribution Decoupling (SDD), a novel approach that stabilizes training by explicitly decoupling the scale and distribution of the weight matrix in fully-connected layers. SDD applies a normalization mechanism to regulate activations and a learnable scaling vector to maintain well-conditioned gradients, effectively preventing gradient explosion and dissipation. This separation improves optimization efficiency, particularly in deep networks, by ensuring stable gradient propagation. Experimental results demonstrate that our method stabilizes training across various LLM architectures and outperforms existing techniques in different normalization configurations. Furthermore, the proposed method is lightweight and compatible with existing frameworks, making it a practical solution for stabilizing LLM training. Code is available at https://github.com/kaihemo/SDD.

規模分佈解耦：實現大型語言模型的穩定高效訓練

Scale-Distribution Decoupling: Enabling Stable and Effective Training of Large Language Models

摘要

Support