ChatPaper.aiChatPaper

Transformer训练中大规模激活的隐藏动态

Hidden Dynamics of Massive Activations in Transformer Training

August 5, 2025
作者: Jorge Gallego-Feliciano, S. Aaron McClendon, Juan Morinelli, Stavros Zervoudakis, Antonios Saravanos
cs.AI

摘要

大规模激活是Transformer隐藏状态中的标量值,其数值比典型激活高出数个数量级,并已被证明对模型功能至关重要。尽管先前的研究已在完全训练好的模型中描述了这些现象,但它们在训练过程中出现的时间动态仍鲜为人知。我们首次全面分析了Transformer训练过程中大规模激活的发展,以Pythia模型家族为实验平台。通过对不同模型大小在多个训练检查点上的系统分析,我们证明大规模激活的出现遵循可预测的数学模式,这些模式可以通过一个包含五个关键参数的指数调制对数函数精确建模。我们开发了一个机器学习框架,仅从架构规格就能预测这些数学参数,在稳态行为上达到了高精度,在出现时间和幅度上达到了中等精度。这些发现使架构师能够通过设计选择预测并可能控制大规模激活出现的关键方面,对模型稳定性、训练周期长度、可解释性和优化具有重大意义。我们的研究结果表明,大规模激活的出现受模型设计支配,可以在训练开始前预见并可能加以控制。
English
Massive activations are scalar values in transformer hidden states that achieve values orders of magnitude larger than typical activations and have been shown to be critical for model functionality. While prior work has characterized these phenomena in fully trained models, the temporal dynamics of their emergence during training remain poorly understood. We present the first comprehensive analysis of massive activation development throughout transformer training, using the Pythia model family as our testbed. Through systematic analysis of various model sizes across multiple training checkpoints, we demonstrate that massive activation emergence follows predictable mathematical patterns that can be accurately modeled using an exponentially-modulated logarithmic function with five key parameters. We develop a machine learning framework to predict these mathematical parameters from architectural specifications alone, achieving high accuracy for steady-state behavior and moderate accuracy for emergence timing and magnitude. These findings enable architects to predict and potentially control key aspects of massive activation emergence through design choices, with significant implications for model stability, training cycle length, interpretability, and optimization. Our findings demonstrate that the emergence of massive activations is governed by model design and can be anticipated, and potentially controlled, before training begins.
PDF174August 11, 2025