훈련 역학이 훈련 후 양자화 강건성에 미치는 영향

초록

대규모 언어 모델의 효율적인 배치를 위해 사후 양자화가 널리 채택되고 있지만, 양자화 견고성의 기저 메커니즘은 여전히 명확하지 않다. 우리는 32B 파라미터와 15T 학습 토큰에 이르는 오픈소스 언어 모델 학습 궤적 전반에 걸친 양자화 성능 저하를 포괄적으로 분석하여 학습 동역학과 양자화 성능 간의 관계를 정확히 평가하였다. 주요 발견은 대규모 학습 실행에서의 양자화 오류가 학습률과 다른 학습 하이퍼파라미터 간의 복잡한 상호작용에 의해 주도된다는 것이다. 특히, 학습률이 감소하면 검증 손실과 양자화 오류가 분기되며, 이는 학습 데이터 규모와 크게 무관하다. 학습 동역학에 대한 개입을 조사하고 양자화 견고성을 유리하게 조절할 수 있는 특정 구성을 식별하기 위해, 우리는 통제된 실험에서 최대 100B 토큰까지 자체 모델을 학습하였다. 우리의 결과는 데이터셋 규모 증가가 본질적으로 양자화 효과를 저해한다는 가정에 도전하며, 전략적인 학습 하이퍼파라미터 개입이 대규모에서 양자화 품질을 개선할 수 있음을 보여준다.

English

While post-training quantization is widely adopted for efficient deployment of large language models, the mechanisms underlying quantization robustness remain unclear. We conduct a comprehensive analysis of quantization degradation across open-source language model training trajectories up to 32B parameters and 15T training tokens to accurately assess the relationship between training dynamics and quantization performance. Our key finding is that quantization errors in large-scale training runs are driven by a complex interplay between learning rate and other training hyperparameters. Specifically, once learning rates decay, validation loss and quantization error diverge, largely independent of training data scale. To investigate interventions on the training dynamics and identify specific configurations that can modulate quantization robustness favorably, we train our own models in controlled experiments up to 100B tokens. Our results challenge the assumption that increasing dataset scale inherently compromises quantization effectiveness, demonstrating instead that strategic training hyperparameter interventions can improve quantization quality at scale.

훈련 역학이 훈련 후 양자화 강건성에 미치는 영향

Training Dynamics Impact Post-Training Quantization Robustness

초록

Support