GPAS: 그래디언트 보존 활성화 스케일링을 통한 LLM 사전 학습의 수렴 가속화

초록

LLaMA, Qwen, DeepSeek 시리즈와 같은 현대의 대규모 언어 모델(Large Language Models)은 주로 Pre-LayerNorm(Pre-LN) 트랜스포머 아키텍처를 채택하고 있다. Pre-LN은 사전 학습 중 안정적이며 대규모 모델 크기로 확장 가능하지만, 계층 간 활성화 분산이 기하급수적으로 증가하는 문제가 있다. 이는 잔차 경로(residual path)가 하위 계층 출력을 지배하게 하여 더 깊은 계층의 학습 능력을 제한한다. 이러한 문제를 완화하기 위해, 우리는 기존 접근법과 함께 사용할 수 있는 간단한 기법인 Gradient-Preserving Activation Scaling(GPAS)을 제안한다. GPAS는 중간 활성화를 축소하되 그 기울기는 변경하지 않음으로써 동작한다. 이는 활성화 내 정보를 그대로 유지하면서 기울기 소실 문제를 방지한다. 71M에서 1B에 이르는 다양한 모델 크기에서의 광범위한 실험을 통해 GPAS가 일관된 성능 향상을 달성함을 확인했다. Pre-LN 트랜스포머를 개선하는 것 외에도, GPAS는 Sandwich-LN 및 DeepNorm과 같은 대체 아키텍처에서도 개선 가능성을 보여주며, 다양한 설정에서의 학습 역학 개선을 위한 다재다능성과 잠재력을 입증했다.

English

Modern Large Language Models, such as the LLaMA, Qwen and DeepSeek series, predominantly adopt the Pre-LayerNorm (Pre-LN) Transformer architecture. While being stable during pretraining and scalable to large model sizes, Pre-LN suffers from an exponential growth in activation variance across layers, causing the residual path to dominate over sub-layer outputs and limiting the learning capacity of deeper layers. To mitigate this issue, we propose Gradient-Preserving Activation Scaling (GPAS), a simple technique that can be used in combination with existing approaches. GPAS works by scaling down the intermediate activations while keeping their gradients unchanged. This leaves information in the activations intact, and avoids the gradient vanishing problem associated with gradient downscaling. Extensive experiments across various model sizes from 71M to 1B show that GPAS achieves consistent performance gains. Beyond enhancing Pre-LN Transformers, GPAS also shows promise in improving alternative architectures such as Sandwich-LN and DeepNorm, demonstrating its versatility and potential for improving training dynamics in a wide range of settings.

GPAS: 그래디언트 보존 활성화 스케일링을 통한 LLM 사전 학습의 수렴 가속화

GPAS: Accelerating Convergence of LLM Pretraining via Gradient-Preserving Activation Scaling

초록

Support