トレーニングダイナミクスがポストトレーニング量子化のロバスト性に与える影響

要旨

大規模言語モデルの効率的な展開において、学習後の量子化が広く採用されている一方で、量子化の頑健性を支えるメカニズムは未だ明らかではない。本研究では、最大32Bパラメータと15T学習トークンに及ぶオープンソース言語モデルの学習軌跡における量子化劣化を包括的に分析し、学習ダイナミクスと量子化性能の関係を正確に評価した。主な発見として、大規模学習における量子化誤差は、学習率とその他の学習ハイパーパラメータの複雑な相互作用によって駆動されることが明らかになった。具体的には、学習率が減衰すると、検証損失と量子化誤差が乖離し、これは学習データの規模にほぼ依存しない。学習ダイナミクスに対する介入を調査し、量子化の頑健性を有利に調整できる特定の構成を特定するため、制御された実験環境で最大100Bトークンまでの独自モデルを学習した。その結果、データセット規模の増大が本質的に量子化の有効性を損なうという仮定に疑問を投げかけ、戦略的な学習ハイパーパラメータの介入が大規模な量子化品質を向上させ得ることを実証した。

English

While post-training quantization is widely adopted for efficient deployment of large language models, the mechanisms underlying quantization robustness remain unclear. We conduct a comprehensive analysis of quantization degradation across open-source language model training trajectories up to 32B parameters and 15T training tokens to accurately assess the relationship between training dynamics and quantization performance. Our key finding is that quantization errors in large-scale training runs are driven by a complex interplay between learning rate and other training hyperparameters. Specifically, once learning rates decay, validation loss and quantization error diverge, largely independent of training data scale. To investigate interventions on the training dynamics and identify specific configurations that can modulate quantization robustness favorably, we train our own models in controlled experiments up to 100B tokens. Our results challenge the assumption that increasing dataset scale inherently compromises quantization effectiveness, demonstrating instead that strategic training hyperparameter interventions can improve quantization quality at scale.

トレーニングダイナミクスがポストトレーニング量子化のロバスト性に与える影響

Training Dynamics Impact Post-Training Quantization Robustness

要旨

Support