光谱:三元、量化和FP16语言模型的全面研究
Spectra: A Comprehensive Study of Ternary, Quantized, and FP16 Language Models
July 17, 2024
作者: Ayush Kaushal, Tejas Pandey, Tejas Vaidhya, Aaryan Bhagat, Irina Rish
cs.AI
摘要
后训练量化是解决LLM推理中与内存相关的瓶颈的主要方法,但遗憾的是,在4位以下的精度下,它会遭受显著的性能下降。另一种替代方法涉及直接在低位宽(例如,二进制或三进制模型)上训练压缩模型。然而,这些模型的性能、训练动态和扩展趋势尚未得到很好的理解。为了解决这个问题,我们训练并公开发布了Spectra LLM套件,包括54个语言模型,参数范围从99M到3.9B,训练了300B个标记。Spectra包括FloatLMs、后训练量化的QuantLMs(3、4、6和8位)和三进制LLMs(TriLMs)-我们改进的三进制语言建模架构,明显优于先前提出的相同大小(以位计)的三进制模型,与规模相匹配的半精度模型。例如,TriLM 3.9B(按位)比半精度FloatLM 830M更小,但在常识推理和知识基准上与半精度FloatLM 3.9B相匹配。然而,TriLM 3.9B也像FloatLM 3.9B一样具有毒性和刻板印象,后者的大小是它的六倍。此外,TriLM 3.9B在验证集和基于网络的语料库的困惑度上落后于FloatLM,但在像Lambada和PennTreeBank这样的噪声较小的数据集上表现更好。
为了增进对低位宽模型的理解,我们发布了Spectra套件的500多个中间检查点,网址为https://github.com/NolanoOrg/SpectraSuite。
English
Post-training quantization is the leading method for addressing
memory-related bottlenecks in LLM inference, but unfortunately, it suffers from
significant performance degradation below 4-bit precision. An alternative
approach involves training compressed models directly at a low bitwidth (e.g.,
binary or ternary models). However, the performance, training dynamics, and
scaling trends of such models are not yet well understood. To address this
issue, we train and openly release the Spectra LLM suite consisting of 54
language models ranging from 99M to 3.9B parameters, trained on 300B tokens.
Spectra includes FloatLMs, post-training quantized QuantLMs (3, 4, 6, and 8
bits), and ternary LLMs (TriLMs) - our improved architecture for ternary
language modeling, which significantly outperforms previously proposed ternary
models of a given size (in bits), matching half-precision models at scale. For
example, TriLM 3.9B is (bit-wise) smaller than the half-precision FloatLM 830M,
but matches half-precision FloatLM 3.9B in commonsense reasoning and knowledge
benchmarks. However, TriLM 3.9B is also as toxic and stereotyping as FloatLM
3.9B, a model six times larger in size. Additionally, TriLM 3.9B lags behind
FloatLM in perplexity on validation splits and web-based corpora but performs
better on less noisy datasets like Lambada and PennTreeBank.
To enhance understanding of low-bitwidth models, we are releasing 500+
intermediate checkpoints of the Spectra suite at
https://github.com/NolanoOrg/SpectraSuite{https://github.com/NolanoOrg/SpectraSuite}.Summary
AI-Generated Summary