外れ値耐性を備えた事前学習による大規模言語モデルの4ビット量子化のロバスト化

要旨

大規模言語モデル（LLMs）における極端な活性化外れ値は、量子化性能を著しく低下させ、効率的なオンデバイス展開を妨げる。チャネル単位の操作や適応的な勾配スケーリングがその原因として認識されているが、実践的な緩和策は依然として困難である。本研究では、事後的な緩和に頼るのではなく、外れ値の形成を事前に防ぐ実用的なガイドラインであるOutlier-Safe Pre-Training（OSP）を提案する。OSPは以下の3つの主要な革新を組み合わせている：（1）Muonオプティマイザー、特権基底を排除しつつトレーニング効率を維持する；（2）Single-Scale RMSNorm、チャネル単位の増幅を防止する；（3）学習可能な埋め込み射影、埋め込み行列に起因する活性化の大きさを再分配する。OSPを検証するため、1兆トークンで1.4Bパラメータのモデルをトレーニングし、このような外れ値なしでトレーニングされた初の本番規模LLMを実現した。攻撃的な4ビット量子化の下で、OSPモデルは10のベンチマークで平均スコア35.7（Adamトレーニングモデルの26.5と比較）を達成し、トレーニングオーバーヘッドはわずか2％であった。注目すべきは、OSPモデルは標準モデルの極端な値（1818.56）と比較してほぼゼロの超過尖度（0.04）を示し、LLMの量子化挙動を根本的に変えたことである。本研究は、外れ値がLLMに固有のものではなく、トレーニング戦略の結果であることを示し、より効率的なLLM展開の道を開いた。ソースコードと事前トレーニング済みチェックポイントはhttps://github.com/dmis-lab/Outlier-Safe-Pre-Trainingで公開されている。

English

Extreme activation outliers in Large Language Models (LLMs) critically degrade quantization performance, hindering efficient on-device deployment. While channel-wise operations and adaptive gradient scaling are recognized causes, practical mitigation remains challenging. We introduce Outlier-Safe Pre-Training (OSP), a practical guideline that proactively prevents outlier formation rather than relying on post-hoc mitigation. OSP combines three key innovations: (1) the Muon optimizer, eliminating privileged bases while maintaining training efficiency; (2) Single-Scale RMSNorm, preventing channel-wise amplification; and (3) a learnable embedding projection, redistributing activation magnitudes originating from embedding matrices. We validate OSP by training a 1.4B-parameter model on 1 trillion tokens, which is the first production-scale LLM trained without such outliers. Under aggressive 4-bit quantization, our OSP model achieves a 35.7 average score across 10 benchmarks (compared to 26.5 for an Adam-trained model), with only a 2% training overhead. Remarkably, OSP models exhibit near-zero excess kurtosis (0.04) compared to extreme values (1818.56) in standard models, fundamentally altering LLM quantization behavior. Our work demonstrates that outliers are not inherent to LLMs but are consequences of training strategies, paving the way for more efficient LLM deployment. The source code and pretrained checkpoints are available at https://github.com/dmis-lab/Outlier-Safe-Pre-Training.

外れ値耐性を備えた事前学習による大規模言語モデルの4ビット量子化のロバスト化

Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models

要旨

Support