트랜스포머 학습에서 대규모 활성화의 숨겨진 동역학

초록

대규모 활성화는 트랜스포머 은닉 상태에서 일반적인 활성화보다 수 차례 더 큰 스칼라 값으로, 모델 기능에 중요한 역할을 하는 것으로 밝혀졌다. 기존 연구에서는 완전히 학습된 모델에서 이러한 현상을 특성화했으나, 학습 과정 중 이들의 발생 시점과 동역학에 대한 이해는 여전히 부족한 상태이다. 본 연구에서는 Pythia 모델 패밀리를 테스트베드로 활용하여 트랜스포머 학습 전반에 걸친 대규모 활성화 발달에 대한 첫 번째 포괄적인 분석을 제시한다. 다양한 모델 크기와 학습 체크포인트에 대한 체계적인 분석을 통해, 대규모 활성화의 발생이 다섯 가지 주요 매개변수를 갖는 지수적으로 조절된 로그 함수로 정확하게 모델링될 수 있는 예측 가능한 수학적 패턴을 따른다는 것을 입증한다. 또한, 아키텍처 사양만으로 이러한 수학적 매개변수를 예측하기 위한 머신러닝 프레임워크를 개발하여, 정상 상태 동작에 대해 높은 정확도를, 발생 시점과 크기에 대해 중간 정도의 정확도를 달성한다. 이러한 연구 결과는 설계 선택을 통해 대규모 활성화 발생의 주요 측면을 예측하고 잠재적으로 제어할 수 있게 함으로써 모델 안정성, 학습 주기 길이, 해석 가능성 및 최적화에 중요한 시사점을 제공한다. 본 연구는 대규모 활성화의 발생이 모델 설계에 의해 지배되며, 학습 시작 전에 예측하고 잠재적으로 제어할 수 있음을 입증한다.

English

Massive activations are scalar values in transformer hidden states that achieve values orders of magnitude larger than typical activations and have been shown to be critical for model functionality. While prior work has characterized these phenomena in fully trained models, the temporal dynamics of their emergence during training remain poorly understood. We present the first comprehensive analysis of massive activation development throughout transformer training, using the Pythia model family as our testbed. Through systematic analysis of various model sizes across multiple training checkpoints, we demonstrate that massive activation emergence follows predictable mathematical patterns that can be accurately modeled using an exponentially-modulated logarithmic function with five key parameters. We develop a machine learning framework to predict these mathematical parameters from architectural specifications alone, achieving high accuracy for steady-state behavior and moderate accuracy for emergence timing and magnitude. These findings enable architects to predict and potentially control key aspects of massive activation emergence through design choices, with significant implications for model stability, training cycle length, interpretability, and optimization. Our findings demonstrate that the emergence of massive activations is governed by model design and can be anticipated, and potentially controlled, before training begins.

트랜스포머 학습에서 대규모 활성화의 숨겨진 동역학

Hidden Dynamics of Massive Activations in Transformer Training

초록

Support