Verdwaald in Backpropagatie: De LM Head vormt een Gradiëntknelpunt

Samenvatting

De laatste laag van neurale taalmmodellen projecteert uitvoerfeatures van dimensie D naar logits in dimensie V, de grootte van de vocabulaire, waarbij doorgaans D ≪ V. Deze mismatch staat bekend als een risicofactor voor beperkte expressiviteit in neurale taalmmodellen, wat een zogenaamd softmax-knelpunt creëert. Wij tonen aan dat het softmax-knelpunt niet alleen een expressiviteitsknelpunt is, maar ook een optimalisatieknelpunt. Het backpropageren van V-dimensionale gradienten door een lineaire laag met rang D induceert een onvermijdelijke compressie, waardoor de trainingsfeedback voor de overgrote meerderheid van de parameters wordt veranderd. Wij presenteren een theoretische analyse van dit fenomeen en meten empirisch dat 95-99% van de gradientnorm wordt onderdrukt door de uitvoerlaag, wat resulteert in sterk suboptimale updaterichtingen. Wij voeren gecontroleerde pretraining-experimenten uit die aantonen dat het gradientknelpunt triviale patronen onleerbaar maakt en de trainingsdynamiek van grote taalmmodellen drastisch beïnvloedt. Wij beargumenteren dat deze inherente tekortkoming bijdraagt aan trainingsinefficiënties op schaal, onafhankelijk van de modelarchitectuur, en de behoefte aan nieuwe ontwerpen voor de LM-uitvoerkop vergroot.

English

The last layer of neural language models (LMs) projects output features of dimension D to logits in dimension V, the size of the vocabulary, where usually D ll V. This mismatch is known to raise risks of limited expressivity in neural LMs, creating a so-called softmax bottleneck. We show the softmax bottleneck is not only an expressivity bottleneck but also an optimization bottleneck. Backpropagating V-dimensional gradients through a rank-D linear layer induces unavoidable compression, which alters the training feedback provided to the vast majority of the parameters. We present a theoretical analysis of this phenomenon and measure empirically that 95-99% of the gradient norm is suppressed by the output layer, resulting in vastly suboptimal update directions. We conduct controlled pretraining experiments showing that the gradient bottleneck makes trivial patterns unlearnable, and drastically affects the training dynamics of LLMs. We argue that this inherent flaw contributes to training inefficiencies at scale independently of the model architecture, and raises the need for new LM head designs.

Verdwaald in Backpropagatie: De LM Head vormt een Gradiëntknelpunt

Lost in Backpropagation: The LM Head is a Gradient Bottleneck

Samenvatting

Support