GPAS: 勾配保存活性化スケーリングによるLLM事前学習の収束加速

要旨

現代の大規模言語モデル、例えばLLaMA、Qwen、DeepSeekシリーズは、主にPre-LayerNorm（Pre-LN）Transformerアーキテクチャを採用している。Pre-LNは、事前学習中に安定しており、大規模なモデルサイズにスケーラブルである一方で、層を跨いだ活性化分散の指数関数的な増加に悩まされており、これにより残差経路がサブ層の出力を支配し、深い層の学習能力が制限されている。この問題を緩和するため、我々はGradient-Preserving Activation Scaling（GPAS）を提案する。GPASは、既存のアプローチと組み合わせて使用できるシンプルな技術であり、中間活性化をスケールダウンしながらその勾配を変更しないことで、活性化内の情報をそのまま保ち、勾配の消失問題を回避する。71Mから1Bまでの様々なモデルサイズでの広範な実験により、GPASが一貫した性能向上を達成することが示された。Pre-LN Transformerの強化に加えて、GPASはSandwich-LNやDeepNormなどの代替アーキテクチャの改善にも有望であり、その汎用性と幅広い設定でのトレーニングダイナミクス改善の可能性を示している。

English

Modern Large Language Models, such as the LLaMA, Qwen and DeepSeek series, predominantly adopt the Pre-LayerNorm (Pre-LN) Transformer architecture. While being stable during pretraining and scalable to large model sizes, Pre-LN suffers from an exponential growth in activation variance across layers, causing the residual path to dominate over sub-layer outputs and limiting the learning capacity of deeper layers. To mitigate this issue, we propose Gradient-Preserving Activation Scaling (GPAS), a simple technique that can be used in combination with existing approaches. GPAS works by scaling down the intermediate activations while keeping their gradients unchanged. This leaves information in the activations intact, and avoids the gradient vanishing problem associated with gradient downscaling. Extensive experiments across various model sizes from 71M to 1B show that GPAS achieves consistent performance gains. Beyond enhancing Pre-LN Transformers, GPAS also shows promise in improving alternative architectures such as Sandwich-LN and DeepNorm, demonstrating its versatility and potential for improving training dynamics in a wide range of settings.

GPAS: 勾配保存活性化スケーリングによるLLM事前学習の収束加速

GPAS: Accelerating Convergence of LLM Pretraining via Gradient-Preserving Activation Scaling

要旨

Support