大型語言模型中的深度之咒
The Curse of Depth in Large Language Models
February 9, 2025
作者: Wenfang Sun, Xinyuan Song, Pengxiang Li, Lu Yin, Yefeng Zheng, Shiwei Liu
cs.AI
摘要
本文介紹了「深度詛咒」這一概念,突顯、解釋和應對現代大型語言模型(LLMs)中近一半的層次效果不如預期的最近觀察。我們首先確認了這一現象在最流行的LLM家族中普遍存在,如Llama、Mistral、DeepSeek和Qwen。我們的理論和實證分析確定了造成LLMs深層次效果不佳的根本原因是廣泛使用的預層規範化(Pre-LN)。雖然Pre-LN穩定了Transformer LLM的訓練,但其輸出變異量隨著模型深度呈指數增長,導致深層Transformer塊的導數幾乎成為一個恆等矩陣,因此幾乎不對訓練產生貢獻。為了解決這一訓練障礙,我們提出了層規範化縮放(LayerNorm Scaling),通過將層規範化的輸出變異量按其深度的平方根的倒數進行縮放。這一簡單修改減輕了更深Transformer層的輸出變異爆炸,提高了它們的貢獻。我們的實驗結果涵蓋了從130M到1B的模型規模,顯示與Pre-LN相比,層規範化縮放顯著提高了LLM的預訓練性能。此外,這一改進無縫地延續到監督微調。所有這些收益都歸因於層規範化縮放使更深層次在訓練期間更有效地發揮作用。
English
In this paper, we introduce the Curse of Depth, a concept that highlights,
explains, and addresses the recent observation in modern Large Language
Models(LLMs) where nearly half of the layers are less effective than expected.
We first confirm the wide existence of this phenomenon across the most popular
families of LLMs such as Llama, Mistral, DeepSeek, and Qwen. Our analysis,
theoretically and empirically, identifies that the underlying reason for the
ineffectiveness of deep layers in LLMs is the widespread usage of Pre-Layer
Normalization (Pre-LN). While Pre-LN stabilizes the training of Transformer
LLMs, its output variance exponentially grows with the model depth, which
undesirably causes the derivative of the deep Transformer blocks to be an
identity matrix, and therefore barely contributes to the training. To resolve
this training pitfall, we propose LayerNorm Scaling, which scales the variance
of output of the layer normalization inversely by the square root of its depth.
This simple modification mitigates the output variance explosion of deeper
Transformer layers, improving their contribution. Our experimental results,
spanning model sizes from 130M to 1B, demonstrate that LayerNorm Scaling
significantly enhances LLM pre-training performance compared to Pre-LN.
Moreover, this improvement seamlessly carries over to supervised fine-tuning.
All these gains can be attributed to the fact that LayerNorm Scaling enables
deeper layers to contribute more effectively during training.Summary
AI-Generated Summary