迷失在反向传播中：语言模型头部的梯度瓶颈问题

摘要

神经语言模型（LMs）的最后一层需将维度为D的输出特征映射至词汇表大小V维度的逻辑值，而通常存在D远小于V的维度失配问题。这种失配不仅会引发表达能力受限的风险（即所谓的softmax瓶颈），本文更揭示其同时构成优化瓶颈。当V维梯度经由秩为D的线性层反向传播时，会引发不可避免的压缩效应，从而改变传递给绝大多数参数的训练反馈。我们通过理论分析证明并实证测得95-99%的梯度范数被输出层抑制，导致更新方向严重偏离最优轨迹。受控预训练实验表明，梯度瓶颈会使简单模式变得不可学习，并显著影响大语言模型的训练动态。我们认为这一固有缺陷在不同模型架构下均会导致大规模训练的低效性，从而凸显了设计新型语言模型头结构的迫切需求。

English

The last layer of neural language models (LMs) projects output features of dimension D to logits in dimension V, the size of the vocabulary, where usually D ll V. This mismatch is known to raise risks of limited expressivity in neural LMs, creating a so-called softmax bottleneck. We show the softmax bottleneck is not only an expressivity bottleneck but also an optimization bottleneck. Backpropagating V-dimensional gradients through a rank-D linear layer induces unavoidable compression, which alters the training feedback provided to the vast majority of the parameters. We present a theoretical analysis of this phenomenon and measure empirically that 95-99% of the gradient norm is suppressed by the output layer, resulting in vastly suboptimal update directions. We conduct controlled pretraining experiments showing that the gradient bottleneck makes trivial patterns unlearnable, and drastically affects the training dynamics of LLMs. We argue that this inherent flaw contributes to training inefficiencies at scale independently of the model architecture, and raises the need for new LM head designs.

迷失在反向传播中：语言模型头部的梯度瓶颈问题

Lost in Backpropagation: The LM Head is a Gradient Bottleneck

摘要

Support