大規模言語モデルにおける深さの呪い

要旨

本論文では、深さの呪い（Curse of Depth）という概念を紹介し、最近の現代の大規模言語モデル（LLMs）において、ほぼ半数の層が予想よりも効果が低いという観察を強調し、説明し、対処します。まず、Llama、Mistral、DeepSeek、Qwenなどの最も人気のあるLLMsファミリー全体でこの現象が広く存在することを確認します。我々の分析は、理論的および経験的に、LLMsの深い層が効果がない理由は、Pre-Layer正規化（Pre-LN）の広範な使用にあることを特定しています。Pre-LNはTransformer LLMsのトレーニングを安定化させますが、その出力の分散はモデルの深さとともに指数関数的に増加し、深いTransformerブロックの導関数が恒等行列となり、トレーニングにほとんど寄与しないという望ましくない結果をもたらします。このトレーニングの落とし穴を解決するために、我々はLayerNorm Scalingを提案します。これは、層の出力の分散をその深さの平方根で逆にスケーリングするものです。この単純な変更により、より深いTransformer層の出力分散の爆発が緩和され、それらの貢献が向上します。130Mから1Bまでのモデルサイズにわたる実験結果は、LayerNorm ScalingがPre-LNと比較してLLMの事前トレーニング性能を大幅に向上させることを示しています。さらに、この改善は監督されたファインチューニングにもスムーズに引き継がれます。これらのすべての利点は、LayerNorm Scalingがトレーニング中により深い層がより効果的に貢献することを可能にするという事実に帰せられます。

English

In this paper, we introduce the Curse of Depth, a concept that highlights, explains, and addresses the recent observation in modern Large Language Models(LLMs) where nearly half of the layers are less effective than expected. We first confirm the wide existence of this phenomenon across the most popular families of LLMs such as Llama, Mistral, DeepSeek, and Qwen. Our analysis, theoretically and empirically, identifies that the underlying reason for the ineffectiveness of deep layers in LLMs is the widespread usage of Pre-Layer Normalization (Pre-LN). While Pre-LN stabilizes the training of Transformer LLMs, its output variance exponentially grows with the model depth, which undesirably causes the derivative of the deep Transformer blocks to be an identity matrix, and therefore barely contributes to the training. To resolve this training pitfall, we propose LayerNorm Scaling, which scales the variance of output of the layer normalization inversely by the square root of its depth. This simple modification mitigates the output variance explosion of deeper Transformer layers, improving their contribution. Our experimental results, spanning model sizes from 130M to 1B, demonstrate that LayerNorm Scaling significantly enhances LLM pre-training performance compared to Pre-LN. Moreover, this improvement seamlessly carries over to supervised fine-tuning. All these gains can be attributed to the fact that LayerNorm Scaling enables deeper layers to contribute more effectively during training.

大規模言語モデルにおける深さの呪い

The Curse of Depth in Large Language Models

要旨

Support