異常值安全預訓練：實現大型語言模型穩健的4位元量化

摘要

大型語言模型（LLMs）中的極端激活異常值嚴重降低了量化性能，阻礙了在設備上的高效部署。雖然通道級操作和自適應梯度縮放被認為是主要原因，但實際的緩解措施仍然具有挑戰性。我們引入了異常值安全預訓練（Outlier-Safe Pre-Training, OSP），這是一種實用指南，主動防止異常值的形成，而不是依賴於事後緩解。OSP結合了三項關鍵創新：（1）Muon優化器，消除特權基的同時保持訓練效率；（2）單尺度RMSNorm，防止通道級放大；（3）可學習的嵌入投影，重新分配源自嵌入矩陣的激活幅度。我們通過在1萬億個token上訓練一個14億參數的模型來驗證OSP，這是首個在生產規模上訓練且無此類異常值的LLM。在激進的4位量化下，我們的OSP模型在10個基準測試中平均得分為35.7（相比之下，使用Adam訓練的模型得分為26.5），且僅增加了2%的訓練開銷。值得注意的是，OSP模型的超額峰度接近零（0.04），而標準模型中的極端值為1818.56，從根本上改變了LLM的量化行為。我們的工作表明，異常值並非LLM固有的，而是訓練策略的結果，為更高效的LLM部署鋪平了道路。源代碼和預訓練檢查點可在https://github.com/dmis-lab/Outlier-Safe-Pre-Training獲取。

English

Extreme activation outliers in Large Language Models (LLMs) critically degrade quantization performance, hindering efficient on-device deployment. While channel-wise operations and adaptive gradient scaling are recognized causes, practical mitigation remains challenging. We introduce Outlier-Safe Pre-Training (OSP), a practical guideline that proactively prevents outlier formation rather than relying on post-hoc mitigation. OSP combines three key innovations: (1) the Muon optimizer, eliminating privileged bases while maintaining training efficiency; (2) Single-Scale RMSNorm, preventing channel-wise amplification; and (3) a learnable embedding projection, redistributing activation magnitudes originating from embedding matrices. We validate OSP by training a 1.4B-parameter model on 1 trillion tokens, which is the first production-scale LLM trained without such outliers. Under aggressive 4-bit quantization, our OSP model achieves a 35.7 average score across 10 benchmarks (compared to 26.5 for an Adam-trained model), with only a 2% training overhead. Remarkably, OSP models exhibit near-zero excess kurtosis (0.04) compared to extreme values (1818.56) in standard models, fundamentally altering LLM quantization behavior. Our work demonstrates that outliers are not inherent to LLMs but are consequences of training strategies, paving the way for more efficient LLM deployment. The source code and pretrained checkpoints are available at https://github.com/dmis-lab/Outlier-Safe-Pre-Training.

異常值安全預訓練：實現大型語言模型穩健的4位元量化

Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models

摘要

Support