양자화된 명령어 조정 대형 언어 모델의 포괄적 평가: 405B까지의 실험적 분석

초록

이전 연구는 퍼플렉서티나 몇 가지 기본적인 지식 작업 및 오래된 데이터셋과 같은 제한된 측정 항목을 사용하여 양자화된 LLMs를 평가해 왔습니다. 게다가, 최근에는 405B까지의 Llama 3.1과 같은 대규모 모델이 철저히 조사되지 않았습니다. 본 논문에서는 7B에서 405B까지 모델에 걸쳐 GPTQ, AWQ, SmoothQuant 및 FP8과 같은 다양한 양자화 방법(GPTQ, AWQ, SmoothQuant, FP8)을 사용하여 지시어에 맞게 조정된 LLM의 성능을 평가합니다. 13개의 벤치마크를 사용하여, 우리는 상식적인 Q&A, 지식 및 언어 이해, 지시어 따르기, 환각 탐지, 수학, 대화와 같은 여섯 가지 작업 유형을 통해 성능을 평가합니다. 우리의 주요 결과는 다음과 같습니다: (1) 더 큰 LLM을 더 작은 FP16 LLM과 유사한 크기로 양자화하는 것이 환각 탐지와 지시어 따르기를 제외한 대부분의 벤치마크에서 일반적으로 더 나은 성능을 발휘합니다; (2) 성능은 다양한 양자화 방법, 모델 크기 및 비트 폭과 함께 상당히 변동하며, 대형 모델에서는 주로 가중치만 사용하는 방법이 더 나은 결과를 도출합니다; (3) 작업의 난이도는 양자화로 인한 정확도 저하에 큰 영향을 미치지 않습니다; 그리고 (4) MT-Bench 평가 방법은 최근 고성능 LLM들 사이에서는 한정된 차별력을 가지고 있습니다.

English

Prior research works have evaluated quantized LLMs using limited metrics such as perplexity or a few basic knowledge tasks and old datasets. Additionally, recent large-scale models such as Llama 3.1 with up to 405B have not been thoroughly examined. This paper evaluates the performance of instruction-tuned LLMs across various quantization methods (GPTQ, AWQ, SmoothQuant, and FP8) on models ranging from 7B to 405B. Using 13 benchmarks, we assess performance across six task types: commonsense Q\&A, knowledge and language understanding, instruction following, hallucination detection, mathematics, and dialogue. Our key findings reveal that (1) quantizing a larger LLM to a similar size as a smaller FP16 LLM generally performs better across most benchmarks, except for hallucination detection and instruction following; (2) performance varies significantly with different quantization methods, model size, and bit-width, with weight-only methods often yielding better results in larger models; (3) task difficulty does not significantly impact accuracy degradation due to quantization; and (4) the MT-Bench evaluation method has limited discriminatory power among recent high-performing LLMs.

양자화된 명령어 조정 대형 언어 모델의 포괄적 평가: 405B까지의 실험적 분석

A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language Models: An Experimental Analysis up to 405B

초록

Support