异常值安全的预训练：实现大型语言模型稳健的4位量化

摘要

大型语言模型（LLMs）中的极端激活异常值严重降低了量化性能，阻碍了在设备上的高效部署。尽管通道级操作和自适应梯度缩放被认为是导致这一现象的原因，但实际缓解措施仍具挑战性。我们提出了异常值安全预训练（Outlier-Safe Pre-Training, OSP），这是一项主动预防异常值形成的实用指南，而非依赖事后缓解。OSP结合了三大创新点：(1) Muon优化器，在保持训练效率的同时消除特权基；(2) 单尺度RMSNorm，防止通道级放大；(3) 可学习的嵌入投影，重新分配源自嵌入矩阵的激活幅度。我们通过在1万亿token上训练一个1.4B参数的模型验证了OSP，这是首个在生产规模上训练且无此类异常值的LLM。在激进的4位量化下，我们的OSP模型在10个基准测试中平均得分35.7（相比之下，使用Adam训练的模型得分为26.5），且仅增加了2%的训练开销。值得注意的是，OSP模型的超峰度接近零（0.04），而标准模型中的极端值高达1818.56，从根本上改变了LLM的量化行为。我们的工作表明，异常值并非LLM固有，而是训练策略的结果，为更高效的LLM部署铺平了道路。源代码及预训练检查点可在https://github.com/dmis-lab/Outlier-Safe-Pre-Training获取。

English

Extreme activation outliers in Large Language Models (LLMs) critically degrade quantization performance, hindering efficient on-device deployment. While channel-wise operations and adaptive gradient scaling are recognized causes, practical mitigation remains challenging. We introduce Outlier-Safe Pre-Training (OSP), a practical guideline that proactively prevents outlier formation rather than relying on post-hoc mitigation. OSP combines three key innovations: (1) the Muon optimizer, eliminating privileged bases while maintaining training efficiency; (2) Single-Scale RMSNorm, preventing channel-wise amplification; and (3) a learnable embedding projection, redistributing activation magnitudes originating from embedding matrices. We validate OSP by training a 1.4B-parameter model on 1 trillion tokens, which is the first production-scale LLM trained without such outliers. Under aggressive 4-bit quantization, our OSP model achieves a 35.7 average score across 10 benchmarks (compared to 26.5 for an Adam-trained model), with only a 2% training overhead. Remarkably, OSP models exhibit near-zero excess kurtosis (0.04) compared to extreme values (1818.56) in standard models, fundamentally altering LLM quantization behavior. Our work demonstrates that outliers are not inherent to LLMs but are consequences of training strategies, paving the way for more efficient LLM deployment. The source code and pretrained checkpoints are available at https://github.com/dmis-lab/Outlier-Safe-Pre-Training.

异常值安全的预训练：实现大型语言模型稳健的4位量化

Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models

摘要

Support