크기에서는 미미하나 효과는 상당한: 대규모 언어 모델의 스케일 벡터에 관하여

초록

현대 대규모 언어 모델(LLM)의 정규화 층은 결정적 정규화 연산과 학습 가능한 스케일 벡터로 구성된다. 정규화 연산은 광범위하게 연구된 반면, 스케일 벡터는 보편적으로 사용됨에도 불구하고 그 이해는 부족한 실정이다. 본 연구에서는 표현력, 최적화, 구조적 구조의 관점에서 LLM의 스케일 벡터에 대한 체계적인 분석을 제시한다. 먼저, 스케일 벡터가 모델 매개변수의 무시할 만한 비율을 차지함에도 불구하고 이를 제거하면 LLM 사전 학습 성능이 현저히 저하된다는 것을 실험적으로 보여준다. 또한 이론적으로 Pre-Norm 구조에서 스케일 벡터는 표현력을 증가시키지 않으며, 대신 후속 선형 매핑에 대한 자기 증폭 선조건 효과를 통해 최적화를 개선함을 증명한다. 둘째, 스케일 벡터에 대한 가중치 감쇠의 역할을 조사한다. 입력-정규화 층과 출력-정규화 층을 구분함으로써, 이들이 최적화와 표현력에서 서로 다른 역할을 수행하기 때문에 전자에는 가중치 감쇠가 유리하고 후자에는 해롭다는 것을 이론적으로 보인다. 셋째, 이러한 이해를 바탕으로 스케일 벡터에 대한 세 가지 경량화된 상호 보완적 개선 방안, 즉 분기별 이질성, 선형 매핑 주변의 배치 개선, 크기-방향 재매개변수화를 제안한다. 이론과 실험 모두 각 개선 방안이 일관된 성능 향상을 가져옴을 보여준다. 마지막으로 이러한 개선 방안을 통합된 스케일-벡터 전략으로 결합하고, 산업 규모의 토큰 예산 하에서 여러 최적화기와 학습률 스케줄을 사용하여 0.12B에서 2B 매개변수 범위의 밀집 모델 및 혼합 전문가 모델에 대한 광범위한 LLM 사전 학습 실험을 통해 평가한다. 통합 전략은 잘 튜닝된 기준 모델보다 일관되게 낮은 최종 손실을 달성하고 더 유리한 스케일링 행동을 보여주며, 매개변수 및 계산 오버헤드는 무시할 수 있는 수준이다.

English

Normalization layers in modern large language models (LLMs) consist of a deterministic normalization operation and a learnable scale vector. While the normalization operation has been extensively studied, the scale vector remains poorly understood despite its ubiquitous use. In this work, we present a systematic study of scale vectors in LLMs from the perspectives of expressivity, optimization, and architectural structure. First, we show empirically that although scale vectors constitute only a negligible fraction of model parameters, removing them substantially degrades LLM pre-training. Our theory further shows that, in Pre-Norm architectures, scale vectors do not increase expressivity; instead, they improve optimization through a self-amplifying preconditioning effect on subsequent linear mappings. Second, we investigate the role of weight decay for scale vectors. By distinguishing Input-Norm and Output-Norm layers, we theoretically show that weight decay is beneficial for the former but harmful for the latter, due to their distinct roles in optimization and expressivity. Third, motivated by this understanding, we propose three lightweight and complementary improvements to scale vectors: branch-specific heterogeneity, improved placement around linear mappings, and magnitude-direction reparameterization. Both theory and experiments show that each improvement yields consistent gains. Finally, we combine these improvements into a unified scale-vector strategy and evaluate it through extensive LLM pre-training experiments on dense and mixture-of-experts models ranging from 0.12B to 2B parameters, across multiple optimizers and learning rate schedules, under industrial-scale token budgets. The unified strategy consistently achieves lower terminal loss than well-tuned baselines and exhibits more favorable scaling behavior, while adding negligible parameter and computational overhead.