Transformer训练中大规模激活的隐藏动态
Hidden Dynamics of Massive Activations in Transformer Training
August 5, 2025
作者: Jorge Gallego-Feliciano, S. Aaron McClendon, Juan Morinelli, Stavros Zervoudakis, Antonios Saravanos
cs.AI
摘要
大规模激活是Transformer隐藏状态中的标量值,其数值比典型激活高出数个数量级,并已被证明对模型功能至关重要。尽管先前的研究已在完全训练好的模型中描述了这些现象,但它们在训练过程中出现的时间动态仍鲜为人知。我们首次全面分析了Transformer训练过程中大规模激活的发展,以Pythia模型家族为实验平台。通过对不同模型大小在多个训练检查点上的系统分析,我们证明大规模激活的出现遵循可预测的数学模式,这些模式可以通过一个包含五个关键参数的指数调制对数函数精确建模。我们开发了一个机器学习框架,仅从架构规格就能预测这些数学参数,在稳态行为上达到了高精度,在出现时间和幅度上达到了中等精度。这些发现使架构师能够通过设计选择预测并可能控制大规模激活出现的关键方面,对模型稳定性、训练周期长度、可解释性和优化具有重大意义。我们的研究结果表明,大规模激活的出现受模型设计支配,可以在训练开始前预见并可能加以控制。
English
Massive activations are scalar values in transformer hidden states that
achieve values orders of magnitude larger than typical activations and have
been shown to be critical for model functionality. While prior work has
characterized these phenomena in fully trained models, the temporal dynamics of
their emergence during training remain poorly understood. We present the first
comprehensive analysis of massive activation development throughout transformer
training, using the Pythia model family as our testbed. Through systematic
analysis of various model sizes across multiple training checkpoints, we
demonstrate that massive activation emergence follows predictable mathematical
patterns that can be accurately modeled using an exponentially-modulated
logarithmic function with five key parameters. We develop a machine learning
framework to predict these mathematical parameters from architectural
specifications alone, achieving high accuracy for steady-state behavior and
moderate accuracy for emergence timing and magnitude. These findings enable
architects to predict and potentially control key aspects of massive activation
emergence through design choices, with significant implications for model
stability, training cycle length, interpretability, and optimization. Our
findings demonstrate that the emergence of massive activations is governed by
model design and can be anticipated, and potentially controlled, before
training begins.