any4: LLM을 위한 학습된 4비트 수치 표현

초록

우리는 대규모 언어 모델(LLM)을 위한 학습 기반 4비트 가중치 양자화 솔루션인 any4를 소개한다. any4는 가중치나 활성화의 전처리 없이도 임의의 수치 표현을 제공한다. 다양한 모델 크기, 세대 및 계열(Llama 2, Llama 3, Mistral, Mixtral)에 대해 평가한 결과, any4는 다른 관련 4비트 수치 표현 유형(int4, fp4, nf4)에 비해 더 높은 정확도를 보였다. any4는 가중치나 활성화의 전처리를 요구하지 않으면서도, 이러한 전처리가 필요한 AWQ나 GPTQ와 같은 직교 기술과도 경쟁력을 갖추고 있다. 또한, any3와 any2를 실험하여 더 낮은 비트 수에서도 경쟁력을 입증했다. 더불어, 대부분의 양자화 접근법에서와 같이 데이터셋의 수백 개 샘플 대신 단 하나의 선별된 다양한 샘플을 사용하여 보정할 수 있음을 보여준다. 또한, LLM을 위한 지연 시간 최적화 GPU 행렬 곱셈 라이브러리인 tinygemm을 오픈소스로 공개한다. 이 라이브러리는 GPU 효율적인 룩업 테이블 전략을 통해 any4를 구현하며, 다른 일반적인 양자화 방법도 함께 제공한다. 우리의 코드는 https://github.com/facebookresearch/any4에서 오픈소스로 공개되어 있다.

English

We present any4, a learned 4-bit weight quantization solution for large language models (LLMs) providing arbitrary numeric representations without requiring pre-processing of weights or activations. any4 yields higher accuracy compared to other related 4-bit numeric representation types: int4, fp4 and nf4, as evaluated on a range of model sizes, generations and families (Llama 2, Llama 3, Mistral and Mixtral). While any4 does not require preprocessing of weights or activations, it is also competitive with orthogonal techniques that require such preprocessing (e.g., AWQ and GPTQ). We also experiment with any3 and any2 and show competitiveness at lower bits. Additionally, we show that we can calibrate using a single curated diverse sample rather than hundreds of samples from a dataset as done in most quantization approaches. We also open source tinygemm, a latency optimized GPU matrix multiplication library for LLMs, that implements any4 using a GPU-efficient lookup table strategy along with other common quantization methods. We open source our code at https://github.com/facebookresearch/any4 .

any4: LLM을 위한 학습된 4비트 수치 표현

any4: Learned 4-bit Numeric Representation for LLMs

초록

Support