ChatPaper.aiChatPaper

超越異常值:量化條件下優化器的研究

Beyond Outliers: A Study of Optimizers Under Quantization

September 27, 2025
作者: Georgios Vlassis, Saleh Ashkboos, Alexandra Volkova, Torsten Hoefler, Dan Alistarh
cs.AI

摘要

随着新型优化器的广泛应用和模型量化成为高效部署的标准,一个关键问题随之浮现:在量化存在的情况下,优化器的选择如何影响模型性能?尽管这两个领域都取得了进展,但关于优化器与量化相互作用的系统性证据仍然有限。为填补这一空白,我们研究了在量化条件下优化器选择对模型鲁棒性的影响,同时考虑了训练后量化(PTQ)和量化感知训练(QAT)。我们首先使用六种优化器训练了参数规模从50M到1.5B的全精度模型,以探索超参数空间,并建立经过良好调校的基线。随后,我们应用PTQ来评估不同优化器训练下模型性能的下降情况。我们发现,诸如最大均值比(MMR)和峰度等与异常值相关的指标,无法预测不同优化器下的PTQ性能。我们通过分析表明,这是由于MMR仅捕捉了孤立层的误差,而忽略了量化误差如何在网络中累积和传播。为了研究QAT下的性能下降,我们从零开始训练量化模型,并将其与原始精度基线进行比较。我们发现,在原始预训练设置中表现良好的优化器在QAT下可能不再是最优选择,而使用Shampoo训练的模型显示出最低的精度下降。最后,我们推导了不同优化器下量化感知训练的缩放定律,表明Shampoo在所有测试的优化器中实现了最高的参数效率。
English
As new optimizers gain traction and model quantization becomes standard for efficient deployment, a key question arises: how does the choice of optimizer affect model performance in the presence of quantization? Despite progress in both areas, systematic evidence on optimizer-quantization interactions remains limited. To fill this gap, we study the impact of optimizer choice on model robustness under quantization, considering both post-training quantization (PTQ), and quantization-aware training (QAT). We first train full-precision models, ranging from 50M to 1.5B parameters, with six optimizers, to explore the hyperparameter landscape, and establish well-tuned baselines. We then apply PTQ to evaluate how model performance degrades when trained with different optimizers. We find that outlier-related metrics, such as the max-to-mean ratio (MMR) and Kurtosis, fail to predict the PTQ performance across different optimizers. We show analytically that this is due to the MMR capturing only isolated layer errors, while ignoring how quantization errors accumulate and propagate through the network. To study the QAT degradation, we train quantized models from scratch and compare them to our original-precision baselines. We find that optimizers performing well in the original pretraining setup may not remain optimal under QAT, and that models trained with Shampoo show the lowest accuracy degradation. Finally, we derive scaling laws for quantization-aware training under different optimizers, showing that Shampoo achieves the highest parameter efficiency of all tested optimizers.
PDF22October 10, 2025