言語モデル事前学習のための段階的残差ウォームアップ

要旨

Transformerアーキテクチャは、現代の大規模言語モデル（LLM）の基盤をなすため、その事前学習の安定性と収束速度は中心的な関心事です。本論文では、連続的に積層された層間の論理的依存関係に着目し、言語モデル事前学習のためのProgressive Residual Warmup（ProRes）を提案します。ProResは「早期の層が先に学習する」という哲学を具現化し、各層の残差接続に0から1へ段階的に増加するスカラー値を乗算します。より深い層ほどウォームアップのステップ数を多く設定し、深層層が早期層の学習が安定するのを待ってから学習に貢献するように設計されています。様々なモデル規模、正規化手法、初期化手法を用いた事前学習実験を通じて、ProResの有効性を実証します。包括的分析により、ProResが事前学習を安定化させるだけでなく、独自の最適化軌道を導入し、より高速な収束、強力な一般化能力、優れた下流タスク性能を実現することを示します。コードはhttps://github.com/dandingsky/ProResで公開しています。

English

Transformer architectures serve as the backbone for most modern Large Language Models, therefore their pretraining stability and convergence speed are of central concern. Motivated by the logical dependency of sequentially stacked layers, we propose Progressive Residual Warmup (ProRes) for language model pretraining. ProRes implements an "early layer learns first" philosophy by multiplying each layer's residual with a scalar that gradually warms up from 0 to 1, with deeper layers taking longer warmup steps. In this way, deeper layers wait for early layers to settle into a more stable regime before contributing to learning. We demonstrate the effectiveness of ProRes through pretraining experiments across various model scales, as well as normalization and initialization schemes. Comprehensive analysis shows that ProRes not only stabilizes pretraining but also introduces a unique optimization trajectory, leading to faster convergence, stronger generalization and better downstream performance. Our code is available at https://github.com/dandingsky/ProRes.

言語モデル事前学習のための段階的残差ウォームアップ

Progressive Residual Warmup for Language Model Pretraining

要旨

Support