誤った逆伝播：LMヘッドが勾配ボトルネックとなる問題

要旨

ニューラル言語モデル（LM）の最終層は、次元Dの出力特徴を語彙サイズVに対応するロジットに写像するが、通常DはVよりもはるかに小さい（D ≪ V）。この次元の不一致は、ニューラルLMの表現力が制限されるリスクを高め、いわゆるsoftmaxボトルネックを生じることが知られている。本論文では、softmaxボトルネックが表現力のボトルネックであるだけでなく、最適化のボトルネックでもあることを示す。V次元の勾配をランクDの線形層を通して逆伝播させることは、不可避的な圧縮を引き起こし、パラメータの大部分に提供される訓練フィードバックを歪める。我々はこの現象に関する理論的分析を行い、勾配ノルムの95～99%が出力層によって抑制され、結果として大幅に最適ではない更新方向が生じることを実証的に計測した。制御された事前学習実験により、勾配ボトルネックが自明なパターンを学習不能にし、大規模言語モデル（LLM）の訓練ダイナミクスに劇的な影響を与えることを示す。我々は、この固有の欠陥がモデルアーキテクチャに依存せず、大規模な訓練における非効率性の一因となっており、新しいLMのヘッド設計の必要性を提起していると論じる。

English

The last layer of neural language models (LMs) projects output features of dimension D to logits in dimension V, the size of the vocabulary, where usually D ll V. This mismatch is known to raise risks of limited expressivity in neural LMs, creating a so-called softmax bottleneck. We show the softmax bottleneck is not only an expressivity bottleneck but also an optimization bottleneck. Backpropagating V-dimensional gradients through a rank-D linear layer induces unavoidable compression, which alters the training feedback provided to the vast majority of the parameters. We present a theoretical analysis of this phenomenon and measure empirically that 95-99% of the gradient norm is suppressed by the output layer, resulting in vastly suboptimal update directions. We conduct controlled pretraining experiments showing that the gradient bottleneck makes trivial patterns unlearnable, and drastically affects the training dynamics of LLMs. We argue that this inherent flaw contributes to training inefficiencies at scale independently of the model architecture, and raises the need for new LM head designs.

誤った逆伝播：LMヘッドが勾配ボトルネックとなる問題

Lost in Backpropagation: The LM Head is a Gradient Bottleneck

要旨

Support