스케일-분포 디커플링: 대규모 언어 모델의 안정적이고 효과적인 학습 가능케 하기

초록

대규모 언어 모델(LLM)의 사전 학습에서 훈련 안정성은 지속적인 과제로, 특히 Post-Norm Transformer와 같은 아키텍처에서 기울기 폭발 및 소멸 문제가 자주 발생합니다. 본 논문에서는 완전 연결 계층에서 가중치 행렬의 스케일과 분포를 명시적으로 분리하여 훈련을 안정화하는 새로운 접근법인 Scale-Distribution Decoupling(SDD)을 제안합니다. SDD는 활성화를 규제하기 위한 정규화 메커니즘과 잘 조절된 기울기를 유지하기 위한 학습 가능한 스케일링 벡터를 적용하여 기울기 폭발 및 소멸을 효과적으로 방지합니다. 이러한 분리는 특히 깊은 네트워크에서 안정적인 기울기 전파를 보장함으로써 최적화 효율성을 향상시킵니다. 실험 결과는 우리의 방법이 다양한 LLM 아키텍처에서 훈련을 안정화하고, 서로 다른 정규화 설정에서 기존 기술을 능가함을 보여줍니다. 또한, 제안된 방법은 경량이며 기존 프레임워크와 호환되어 LLM 훈련 안정화를 위한 실용적인 솔루션으로 적합합니다. 코드는 https://github.com/kaihemo/SDD에서 확인할 수 있습니다.

English

Training stability is a persistent challenge in the pre-training of large language models (LLMs), particularly for architectures such as Post-Norm Transformers, which are prone to gradient explosion and dissipation. In this paper, we propose Scale-Distribution Decoupling (SDD), a novel approach that stabilizes training by explicitly decoupling the scale and distribution of the weight matrix in fully-connected layers. SDD applies a normalization mechanism to regulate activations and a learnable scaling vector to maintain well-conditioned gradients, effectively preventing gradient explosion and dissipation. This separation improves optimization efficiency, particularly in deep networks, by ensuring stable gradient propagation. Experimental results demonstrate that our method stabilizes training across various LLM architectures and outperforms existing techniques in different normalization configurations. Furthermore, the proposed method is lightweight and compatible with existing frameworks, making it a practical solution for stabilizing LLM training. Code is available at https://github.com/kaihemo/SDD.

스케일-분포 디커플링: 대규모 언어 모델의 안정적이고 효과적인 학습 가능케 하기

Scale-Distribution Decoupling: Enabling Stable and Effective Training of Large Language Models

초록

Support