Transformer訓練中大規模激活的隱藏動態

摘要

在Transformer模型的隱藏狀態中，大規模激活是指那些數值遠超典型激活的標量值，這些值已被證明對模型功能至關重要。雖然先前的研究已經在完全訓練好的模型中對這些現象進行了特徵描述，但它們在訓練過程中出現的時間動態仍未被充分理解。我們首次對Transformer訓練過程中大規模激活的發展進行了全面分析，以Pythia模型系列作為測試平台。通過對多種模型大小在多個訓練檢查點上的系統分析，我們證明了大規模激活的出現遵循可預測的數學模式，這些模式可以用一個包含五個關鍵參數的指數調製對數函數精確建模。我們開發了一個機器學習框架，僅從架構規格就能預測這些數學參數，對於穩態行為達到了高精度，對於出現時間和幅度的預測則達到了中等精度。這些發現使架構師能夠通過設計選擇來預測並可能控制大規模激活出現的關鍵方面，這對模型的穩定性、訓練周期長度、可解釋性和優化具有重要意義。我們的研究結果表明，大規模激活的出現受模型設計的支配，並且可以在訓練開始前被預期，甚至可能被控制。

English

Massive activations are scalar values in transformer hidden states that achieve values orders of magnitude larger than typical activations and have been shown to be critical for model functionality. While prior work has characterized these phenomena in fully trained models, the temporal dynamics of their emergence during training remain poorly understood. We present the first comprehensive analysis of massive activation development throughout transformer training, using the Pythia model family as our testbed. Through systematic analysis of various model sizes across multiple training checkpoints, we demonstrate that massive activation emergence follows predictable mathematical patterns that can be accurately modeled using an exponentially-modulated logarithmic function with five key parameters. We develop a machine learning framework to predict these mathematical parameters from architectural specifications alone, achieving high accuracy for steady-state behavior and moderate accuracy for emergence timing and magnitude. These findings enable architects to predict and potentially control key aspects of massive activation emergence through design choices, with significant implications for model stability, training cycle length, interpretability, and optimization. Our findings demonstrate that the emergence of massive activations is governed by model design and can be anticipated, and potentially controlled, before training begins.

Transformer訓練中大規模激活的隱藏動態

Hidden Dynamics of Massive Activations in Transformer Training

摘要

Support