ChatPaper.aiChatPaper

对量化指令调整的大型语言模型进行全面评估:高达405B的实验分析

A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language Models: An Experimental Analysis up to 405B

September 17, 2024
作者: Jemin Lee, Sihyeong Park, Jinse Kwon, Jihun Oh, Yongin Kwon
cs.AI

摘要

先前的研究工作已经评估了使用有限指标(如困惑度或少量基本知识任务以及旧数据集)的量化LLMs。此外,最近的大规模模型,如Llama 3.1,最高可达405B,尚未得到彻底检验。本文评估了在各种量化方法(GPTQ、AWQ、SmoothQuant和FP8)上对从7B到405B范围内的指令调整LLMs的性能。通过13个基准测试,我们评估了六种任务类型的性能:常识问答、知识和语言理解、遵循指令、幻觉检测、数学和对话。我们的主要发现包括:(1)将较大的LLM量化为与较小的FP16 LLM类似大小通常在大多数基准测试中表现更好,但在幻觉检测和遵循指令方面除外;(2)性能因不同的量化方法、模型大小和位宽而显著变化,仅权重方法通常在较大模型中产生更好的结果;(3)任务难度并不显著影响由于量化而导致的准确性下降;以及(4)MT-Bench评估方法在近期表现优异的LLMs之间具有有限的区分能力。
English
Prior research works have evaluated quantized LLMs using limited metrics such as perplexity or a few basic knowledge tasks and old datasets. Additionally, recent large-scale models such as Llama 3.1 with up to 405B have not been thoroughly examined. This paper evaluates the performance of instruction-tuned LLMs across various quantization methods (GPTQ, AWQ, SmoothQuant, and FP8) on models ranging from 7B to 405B. Using 13 benchmarks, we assess performance across six task types: commonsense Q\&A, knowledge and language understanding, instruction following, hallucination detection, mathematics, and dialogue. Our key findings reveal that (1) quantizing a larger LLM to a similar size as a smaller FP16 LLM generally performs better across most benchmarks, except for hallucination detection and instruction following; (2) performance varies significantly with different quantization methods, model size, and bit-width, with weight-only methods often yielding better results in larger models; (3) task difficulty does not significantly impact accuracy degradation due to quantization; and (4) the MT-Bench evaluation method has limited discriminatory power among recent high-performing LLMs.

Summary

AI-Generated Summary

PDF173November 16, 2024