GPTVQ: 대형 언어 모델 양자화를 위한 차원의 축복

초록

본 연구에서는 양자화 차원을 증가시킴으로써 신경망 양자화의 크기 대 정확도 트레이드오프를 크게 개선할 수 있음을 보여준다. 우리는 대규모 언어 모델(LLMs)에 잘 확장되는 새로운 사후 훈련 벡터 양자화(VQ) 방법인 GPTVQ를 제안한다. 우리의 방법은 레이어별 출력 재구성 MSE의 헤시안 정보를 활용하여 하나 이상의 열을 양자화하고 남은 양자화되지 않은 가중치를 업데이트하는 과정을 교차적으로 수행한다. 양자화 코드북은 EM 알고리즘의 효율적인 데이터 인식 버전을 사용하여 초기화된다. 이후 코드북은 업데이트되고, 정수 양자화와 SVD 기반 압축을 통해 추가로 압축된다. GPTVQ는 Llama-v2 및 Mistral과 같은 다양한 LLMs에서 크기 대 정확도 트레이드오프 측면에서 새로운 최첨단 기술을 확립한다. 또한, 우리의 방법은 효율적이다: 단일 H100에서 Llamav2-70B 모델을 처리하는 데 양자화 설정에 따라 3시간에서 11시간이 소요된다. 마지막으로, 모바일 CPU에서 VQ 압축 해제에 대한 온디바이스 타이밍을 통해 VQ가 4비트 정수 형식 사용에 비해 지연 시간을 개선함을 보여준다.

English

In this work we show that the size versus accuracy trade-off of neural network quantization can be significantly improved by increasing the quantization dimensionality. We propose the GPTVQ method, a new fast method for post-training vector quantization (VQ) that scales well to Large Language Models (LLMs). Our method interleaves quantization of one or more columns with updates to the remaining unquantized weights, using information from the Hessian of the per-layer output reconstruction MSE. Quantization codebooks are initialized using an efficient data-aware version of the EM algorithm. The codebooks are then updated, and further compressed by using integer quantization and SVD-based compression. GPTVQ establishes a new state-of-the art in the size vs accuracy trade-offs on a wide range of LLMs such as Llama-v2 and Mistral. Furthermore, our method is efficient: on a single H100 it takes between 3 and 11 hours to process a Llamav2-70B model, depending on quantization setting. Lastly, with on-device timings for VQ decompression on a mobile CPU we show that VQ leads to improved latency compared to using a 4-bit integer format.

GPTVQ: 대형 언어 모델 양자화를 위한 차원의 축복

GPTVQ: The Blessing of Dimensionality for LLM Quantization

초록

Support