对量化指令调整的大型语言模型进行全面评估：高达405B的实验分析

摘要

先前的研究工作已经评估了使用有限指标（如困惑度或少量基本知识任务以及旧数据集）的量化LLMs。此外，最近的大规模模型，如Llama 3.1，最高可达405B，尚未得到彻底检验。本文评估了在各种量化方法（GPTQ、AWQ、SmoothQuant和FP8）上对从7B到405B范围内的指令调整LLMs的性能。通过13个基准测试，我们评估了六种任务类型的性能：常识问答、知识和语言理解、遵循指令、幻觉检测、数学和对话。我们的主要发现包括：（1）将较大的LLM量化为与较小的FP16 LLM类似大小通常在大多数基准测试中表现更好，但在幻觉检测和遵循指令方面除外；（2）性能因不同的量化方法、模型大小和位宽而显著变化，仅权重方法通常在较大模型中产生更好的结果；（3）任务难度并不显著影响由于量化而导致的准确性下降；以及（4）MT-Bench评估方法在近期表现优异的LLMs之间具有有限的区分能力。

English

Prior research works have evaluated quantized LLMs using limited metrics such as perplexity or a few basic knowledge tasks and old datasets. Additionally, recent large-scale models such as Llama 3.1 with up to 405B have not been thoroughly examined. This paper evaluates the performance of instruction-tuned LLMs across various quantization methods (GPTQ, AWQ, SmoothQuant, and FP8) on models ranging from 7B to 405B. Using 13 benchmarks, we assess performance across six task types: commonsense Q\&A, knowledge and language understanding, instruction following, hallucination detection, mathematics, and dialogue. Our key findings reveal that (1) quantizing a larger LLM to a similar size as a smaller FP16 LLM generally performs better across most benchmarks, except for hallucination detection and instruction following; (2) performance varies significantly with different quantization methods, model size, and bit-width, with weight-only methods often yielding better results in larger models; (3) task difficulty does not significantly impact accuracy degradation due to quantization; and (4) the MT-Bench evaluation method has limited discriminatory power among recent high-performing LLMs.

对量化指令调整的大型语言模型进行全面评估：高达405B的实验分析

A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language Models: An Experimental Analysis up to 405B

摘要

Support