对量化指令调整的大型语言模型进行全面评估:高达405B的实验分析
A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language Models: An Experimental Analysis up to 405B
September 17, 2024
作者: Jemin Lee, Sihyeong Park, Jinse Kwon, Jihun Oh, Yongin Kwon
cs.AI
摘要
先前的研究工作已经评估了使用有限指标(如困惑度或少量基本知识任务以及旧数据集)的量化LLMs。此外,最近的大规模模型,如Llama 3.1,最高可达405B,尚未得到彻底检验。本文评估了在各种量化方法(GPTQ、AWQ、SmoothQuant和FP8)上对从7B到405B范围内的指令调整LLMs的性能。通过13个基准测试,我们评估了六种任务类型的性能:常识问答、知识和语言理解、遵循指令、幻觉检测、数学和对话。我们的主要发现包括:(1)将较大的LLM量化为与较小的FP16 LLM类似大小通常在大多数基准测试中表现更好,但在幻觉检测和遵循指令方面除外;(2)性能因不同的量化方法、模型大小和位宽而显著变化,仅权重方法通常在较大模型中产生更好的结果;(3)任务难度并不显著影响由于量化而导致的准确性下降;以及(4)MT-Bench评估方法在近期表现优异的LLMs之间具有有限的区分能力。
English
Prior research works have evaluated quantized LLMs using limited metrics such
as perplexity or a few basic knowledge tasks and old datasets. Additionally,
recent large-scale models such as Llama 3.1 with up to 405B have not been
thoroughly examined. This paper evaluates the performance of instruction-tuned
LLMs across various quantization methods (GPTQ, AWQ, SmoothQuant, and FP8) on
models ranging from 7B to 405B. Using 13 benchmarks, we assess performance
across six task types: commonsense Q\&A, knowledge and language understanding,
instruction following, hallucination detection, mathematics, and dialogue. Our
key findings reveal that (1) quantizing a larger LLM to a similar size as a
smaller FP16 LLM generally performs better across most benchmarks, except for
hallucination detection and instruction following; (2) performance varies
significantly with different quantization methods, model size, and bit-width,
with weight-only methods often yielding better results in larger models; (3)
task difficulty does not significantly impact accuracy degradation due to
quantization; and (4) the MT-Bench evaluation method has limited discriminatory
power among recent high-performing LLMs.Summary
AI-Generated Summary