迷失在反向传播中：LM头部成为梯度瓶颈

摘要

神经语言模型（LM）的最后一层需将维度为D的输出特征映射至词汇表大小V对应的逻辑值空间，通常存在D远小于V的维度失配问题。这种失配不仅会引发表达能力受限的风险（即所谓的softmax瓶颈），本文更揭示其同时构成优化瓶颈。当V维梯度经由秩为D的线性层反向传播时，会引发不可避免的压缩效应，从而改变对绝大多数参数提供的训练反馈。我们通过理论分析证明，输出层会抑制95%-99%的梯度范数，导致更新方向严重偏离最优解。受控预训练实验表明，梯度瓶颈会使简单模式变得不可学习，并显著影响大语言模型的训练动态。我们认为这一固有缺陷会独立于模型架构导致大规模训练低效，亟需新型LM头部结构的设计创新。

English

The last layer of neural language models (LMs) projects output features of dimension D to logits in dimension V, the size of the vocabulary, where usually D ll V. This mismatch is known to raise risks of limited expressivity in neural LMs, creating a so-called softmax bottleneck. We show the softmax bottleneck is not only an expressivity bottleneck but also an optimization bottleneck. Backpropagating V-dimensional gradients through a rank-D linear layer induces unavoidable compression, which alters the training feedback provided to the vast majority of the parameters. We present a theoretical analysis of this phenomenon and measure empirically that 95-99% of the gradient norm is suppressed by the output layer, resulting in vastly suboptimal update directions. We conduct controlled pretraining experiments showing that the gradient bottleneck makes trivial patterns unlearnable, and drastically affects the training dynamics of LLMs. We argue that this inherent flaw contributes to training inefficiencies at scale independently of the model architecture, and raises the need for new LM head designs.