訓練動態影響訓練後量化穩健性

摘要

尽管后训练量化被广泛采用以实现大规模语言模型的高效部署，但量化鲁棒性背后的机制仍不明确。我们对开源语言模型训练轨迹中的量化退化进行了全面分析，模型参数规模高达320亿，训练token数量达到15万亿，以准确评估训练动态与量化性能之间的关系。我们的关键发现是，大规模训练中的量化误差由学习率与其他训练超参数之间的复杂相互作用所驱动。具体而言，一旦学习率衰减，验证损失与量化误差便出现分歧，这一现象在很大程度上与训练数据规模无关。为了探究训练动态的干预措施并识别能够有利调节量化鲁棒性的具体配置，我们在控制实验中训练了自建模型，训练token数量高达1000亿。我们的研究结果挑战了增加数据集规模必然损害量化有效性的假设，相反，证明了策略性的训练超参数干预能够在规模上提升量化质量。

English

While post-training quantization is widely adopted for efficient deployment of large language models, the mechanisms underlying quantization robustness remain unclear. We conduct a comprehensive analysis of quantization degradation across open-source language model training trajectories up to 32B parameters and 15T training tokens to accurately assess the relationship between training dynamics and quantization performance. Our key finding is that quantization errors in large-scale training runs are driven by a complex interplay between learning rate and other training hyperparameters. Specifically, once learning rates decay, validation loss and quantization error diverge, largely independent of training data scale. To investigate interventions on the training dynamics and identify specific configurations that can modulate quantization robustness favorably, we train our own models in controlled experiments up to 100B tokens. Our results challenge the assumption that increasing dataset scale inherently compromises quantization effectiveness, demonstrating instead that strategic training hyperparameter interventions can improve quantization quality at scale.

訓練動態影響訓練後量化穩健性

Training Dynamics Impact Post-Training Quantization Robustness

摘要

Support