训练动态影响训练后量化的鲁棒性

摘要

尽管后训练量化被广泛采用以实现大规模语言模型的高效部署，但量化鲁棒性背后的机制仍不明确。我们对开源语言模型训练轨迹中的量化退化进行了全面分析，涵盖高达320亿参数和15万亿训练标记的规模，以准确评估训练动态与量化性能之间的关系。我们的关键发现是，大规模训练中的量化误差由学习率与其他训练超参数之间的复杂相互作用驱动。具体而言，一旦学习率衰减，验证损失与量化误差就会发生分化，这在很大程度上独立于训练数据规模。为了探究训练动态的干预措施并识别能够有利调节量化鲁棒性的特定配置，我们在受控实验中训练了多达1000亿标记的自有模型。我们的研究结果挑战了增加数据集规模必然损害量化效果的假设，相反证明了策略性的训练超参数干预能够在规模化场景下提升量化质量。

English

While post-training quantization is widely adopted for efficient deployment of large language models, the mechanisms underlying quantization robustness remain unclear. We conduct a comprehensive analysis of quantization degradation across open-source language model training trajectories up to 32B parameters and 15T training tokens to accurately assess the relationship between training dynamics and quantization performance. Our key finding is that quantization errors in large-scale training runs are driven by a complex interplay between learning rate and other training hyperparameters. Specifically, once learning rates decay, validation loss and quantization error diverge, largely independent of training data scale. To investigate interventions on the training dynamics and identify specific configurations that can modulate quantization robustness favorably, we train our own models in controlled experiments up to 100B tokens. Our results challenge the assumption that increasing dataset scale inherently compromises quantization effectiveness, demonstrating instead that strategic training hyperparameter interventions can improve quantization quality at scale.

训练动态影响训练后量化的鲁棒性

Training Dynamics Impact Post-Training Quantization Robustness

摘要

Support