대규모 언어 모델의 강건한 4비트 양자화를 위한 이상치 안전 사전 학습

초록

대규모 언어 모델(LLMs)에서 발생하는 극단적인 활성화 이상치(activation outliers)는 양자화 성능을 심각하게 저하시켜, 효율적인 온디바이스 배포를 방해합니다. 채널별 연산(channel-wise operations)과 적응형 그래디언트 스케일링(adaptive gradient scaling)이 이러한 원인으로 알려져 있지만, 실제적인 완화 방법은 여전히 어려운 과제입니다. 우리는 사후 완화(post-hoc mitigation)에 의존하기보다는 이상치 형성을 사전에 방지하는 실용적인 지침인 Outlier-Safe Pre-Training(OSP)을 제안합니다. OSP는 세 가지 주요 혁신을 결합합니다: (1) Muon 옵티마이저는 특권 기반(privileged bases)을 제거하면서도 훈련 효율성을 유지합니다; (2) Single-Scale RMSNorm은 채널별 증폭(channel-wise amplification)을 방지합니다; (3) 학습 가능한 임베딩 투영(learnable embedding projection)은 임베딩 행렬에서 비롯된 활성화 크기를 재분배합니다. 우리는 1조 개의 토큰으로 1.4B 파라미터 모델을 훈련하여 OSP를 검증했으며, 이는 이상치 없이 훈련된 최초의 생산 규모 LLM입니다. 공격적인 4비트 양자화 하에서, 우리의 OSP 모델은 10개 벤치마크에서 평균 35.7점을 달성했습니다(Adam으로 훈련된 모델은 26.5점). 이는 단 2%의 훈련 오버헤드만 발생시킵니다. 특히, OSP 모델은 표준 모델의 극단적인 값(1818.56)에 비해 거의 제로에 가까운 초과 첨도(excess kurtosis, 0.04)를 보여, LLM 양자화 행동을 근본적으로 변화시켰습니다. 우리의 연구는 이상치가 LLM에 본질적으로 존재하는 것이 아니라 훈련 전략의 결과임을 입증하며, 더 효율적인 LLM 배포의 길을 열었습니다. 소스 코드와 사전 훈련된 체크포인트는 https://github.com/dmis-lab/Outlier-Safe-Pre-Training에서 확인할 수 있습니다.

English

Extreme activation outliers in Large Language Models (LLMs) critically degrade quantization performance, hindering efficient on-device deployment. While channel-wise operations and adaptive gradient scaling are recognized causes, practical mitigation remains challenging. We introduce Outlier-Safe Pre-Training (OSP), a practical guideline that proactively prevents outlier formation rather than relying on post-hoc mitigation. OSP combines three key innovations: (1) the Muon optimizer, eliminating privileged bases while maintaining training efficiency; (2) Single-Scale RMSNorm, preventing channel-wise amplification; and (3) a learnable embedding projection, redistributing activation magnitudes originating from embedding matrices. We validate OSP by training a 1.4B-parameter model on 1 trillion tokens, which is the first production-scale LLM trained without such outliers. Under aggressive 4-bit quantization, our OSP model achieves a 35.7 average score across 10 benchmarks (compared to 26.5 for an Adam-trained model), with only a 2% training overhead. Remarkably, OSP models exhibit near-zero excess kurtosis (0.04) compared to extreme values (1818.56) in standard models, fundamentally altering LLM quantization behavior. Our work demonstrates that outliers are not inherent to LLMs but are consequences of training strategies, paving the way for more efficient LLM deployment. The source code and pretrained checkpoints are available at https://github.com/dmis-lab/Outlier-Safe-Pre-Training.

대규모 언어 모델의 강건한 4비트 양자화를 위한 이상치 안전 사전 학습

Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models

초록

Support