양자화 인지 훈련을 위한 스케일링 법칙

초록

대형 언어 모델(LLM)은 상당한 계산 및 메모리 자원을 요구하여 배포에 어려움을 야기합니다. 양자화 인지 학습(QAT)은 이러한 문제를 해결하기 위해 모델 정밀도를 줄이면서도 성능을 유지합니다. 그러나 특히 4비트 정밀도(W4A4)에서의 QAT 스케일링 동작은 잘 이해되지 않고 있습니다. 기존의 QAT 스케일링 법칙은 종종 학습 토큰 수와 양자화 세분성과 같은 핵심 요소를 무시하여 그 적용 가능성이 제한됩니다. 본 논문은 모델 크기, 학습 데이터 양, 양자화 그룹 크기의 함수로 양자화 오류를 모델링하는 통합된 QAT 스케일링 법칙을 제안합니다. 268개의 QAT 실험을 통해, 양자화 오류는 모델 크기가 증가함에 따라 감소하지만, 더 많은 학습 토큰과 더 거친 양자화 세분성에서는 증가함을 보여줍니다. W4A4 양자화 오류의 원인을 파악하기 위해, 이를 가중치와 활성화 구성 요소로 분해합니다. 두 구성 요소 모두 W4A4 양자화 오류의 전반적인 추세를 따르지만, 서로 다른 민감도를 보입니다. 특히, 가중치 양자화 오류는 더 많은 학습 토큰과 함께 더 빠르게 증가합니다. 추가 분석은 이상치로 인한 FC2 계층의 활성화 양자화 오류가 W4A4 QAT 양자화 오류의 주요 병목 현상임을 보여줍니다. 이 병목 현상을 해결하기 위해 혼합 정밀도 양자화를 적용함으로써, 가중치와 활성화 양자화 오류가 유사한 수준으로 수렴할 수 있음을 입증합니다. 또한, 더 많은 학습 데이터를 사용할 경우, 가중치 양자화 오류는 결국 활성화 양자화 오류를 초과하여, 이러한 시나리오에서 가중치 양자화 오류를 줄이는 것도 중요함을 시사합니다. 이러한 발견들은 QAT 연구 및 개발을 개선하는 데 중요한 통찰을 제공합니다.

English

Large language models (LLMs) demand substantial computational and memory resources, creating deployment challenges. Quantization-aware training (QAT) addresses these challenges by reducing model precision while maintaining performance. However, the scaling behavior of QAT, especially at 4-bit precision (W4A4), is not well understood. Existing QAT scaling laws often ignore key factors such as the number of training tokens and quantization granularity, which limits their applicability. This paper proposes a unified scaling law for QAT that models quantization error as a function of model size, training data volume, and quantization group size. Through 268 QAT experiments, we show that quantization error decreases as model size increases, but rises with more training tokens and coarser quantization granularity. To identify the sources of W4A4 quantization error, we decompose it into weight and activation components. Both components follow the overall trend of W4A4 quantization error, but with different sensitivities. Specifically, weight quantization error increases more rapidly with more training tokens. Further analysis shows that the activation quantization error in the FC2 layer, caused by outliers, is the primary bottleneck of W4A4 QAT quantization error. By applying mixed-precision quantization to address this bottleneck, we demonstrate that weight and activation quantization errors can converge to similar levels. Additionally, with more training data, weight quantization error eventually exceeds activation quantization error, suggesting that reducing weight quantization error is also important in such scenarios. These findings offer key insights for improving QAT research and development.

양자화 인지 훈련을 위한 스케일링 법칙

Scaling Law for Quantization-Aware Training

초록

Support