LLM 양자화를 위한 부호 기울기 하강법을 통한 가중치 반올림 최적화

초록

대규모 언어 모델(LLMs)은 언어 관련 작업 수행에서 뛰어난 능력을 입증했습니다. 그러나 이러한 모델의 배포는 상당한 메모리와 저장 공간 요구 사항으로 인해 큰 어려움을 겪고 있습니다. 이러한 문제에 대응하여, 특히 3비트 및 4비트 가중치 전용 양자화(weight-only quantization)가 가장 실현 가능한 해결책 중 하나로 부상했습니다. 비트 수가 감소함에 따라 양자화 그리드가 넓어지며, 이는 올림과 내림의 중요성을 더욱 강조합니다. 기존 연구에서는 올림과 내림을 미세 조정하고 섭동(perturbation)을 추가함으로써 일부 시나리오에서 정확도를 향상시킬 수 있음을 보여주었지만, 본 연구는 이러한 섭동의 정확하고 제한된 경계에 초점을 맞추며, 오직 반올림 값을 변경하는 임계값만이 중요하다는 점에 주목합니다. 이에 따라, 우리는 가중치 반올림 작업을 최적화하기 위한 간결하고 매우 효과적인 접근 방식을 제안합니다. 우리의 방법인 SignRound는 부호 있는 경사 하강법(signed gradient descent)을 사용한 경량 블록 단위 조정(lightweight block-wise tuning)을 통해 400단계 이내에 탁월한 결과를 달성합니다. SignRound는 기존의 반올림-가장 가까운 값(rounding-to-nearest, RTN) 기준선을 능가하며, 최근의 방법들과도 인상적으로 경쟁력을 보이면서 추가적인 추론 오버헤드를 도입하지 않습니다. 소스 코드는 곧 https://github.com/intel/neural-compressor에서 공개될 예정입니다.

English

Large Language Models (LLMs) have proven their exceptional capabilities in performing language-related tasks. However, their deployment poses significant challenges due to their considerable memory and storage requirements. In response to this issue, weight-only quantization, particularly 3 and 4-bit weight-only quantization, has emerged as one of the most viable solutions. As the number of bits decreases, the quantization grid broadens, thus emphasizing the importance of up and down rounding. While previous studies have demonstrated that fine-tuning up and down rounding with the addition of perturbations can enhance accuracy in some scenarios, our study is driven by the precise and limited boundary of these perturbations, where only the threshold for altering the rounding value is of significance. Consequently, we propose a concise and highly effective approach for optimizing the weight rounding task. Our method, named SignRound, involves lightweight block-wise tuning using signed gradient descent, enabling us to achieve outstanding results within 400 steps. SignRound outperforms the established baseline of rounding-to-nearest (RTN) and competes impressively against recent methods, without introducing additional inference overhead. The source code will be publicly available at https://github.com/intel/neural-compressor soon.

LLM 양자화를 위한 부호 기울기 하강법을 통한 가중치 반올림 최적화

Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs

초록

Support