Spectra:三元、量化和FP16語言模型的全面研究
Spectra: A Comprehensive Study of Ternary, Quantized, and FP16 Language Models
July 17, 2024
作者: Ayush Kaushal, Tejas Pandey, Tejas Vaidhya, Aaryan Bhagat, Irina Rish
cs.AI
摘要
後訓練量化是解決LLM推論中與記憶相關的瓶頸的主要方法,但不幸的是,在4位元以下的精度下,它會遭受顯著的性能下降。另一種方法涉及直接在低位元寬度(例如,二進制或三進制模型)上訓練壓縮模型。然而,這些模型的性能、訓練動態和擴展趨勢尚未被很好地理解。為了解決這個問題,我們訓練並公開發布了Spectra LLM套件,包括54個語言模型,參數範圍從9900萬到39億,訓練了3000億標記。Spectra包括FloatLMs、後訓練量化的QuantLMs(3、4、6和8位元)以及三進制LLMs(TriLMs)-我們改進的三進制語言建模架構,明顯優於先前提出的相同大小(位元)的三進制模型,與規模相符的半精度模型。例如,TriLM 39億比半精度FloatLM 8300萬(位元)更小,但在常識推理和知識基準上與半精度FloatLM 39億相匹配。然而,TriLM 39億也像FloatLM 39億一樣具有毒性和刻板印象,後者是其6倍大小。此外,TriLM 39億在驗證分割和基於網路的語料庫的困惑度方面落後於FloatLM,但在Lambada和PennTreeBank等較少嘈雜的數據集上表現更好。
為了增進對低位元寬度模型的理解,我們將釋出Spectra套件的500多個中間檢查點,網址為https://github.com/NolanoOrg/SpectraSuite。
English
Post-training quantization is the leading method for addressing
memory-related bottlenecks in LLM inference, but unfortunately, it suffers from
significant performance degradation below 4-bit precision. An alternative
approach involves training compressed models directly at a low bitwidth (e.g.,
binary or ternary models). However, the performance, training dynamics, and
scaling trends of such models are not yet well understood. To address this
issue, we train and openly release the Spectra LLM suite consisting of 54
language models ranging from 99M to 3.9B parameters, trained on 300B tokens.
Spectra includes FloatLMs, post-training quantized QuantLMs (3, 4, 6, and 8
bits), and ternary LLMs (TriLMs) - our improved architecture for ternary
language modeling, which significantly outperforms previously proposed ternary
models of a given size (in bits), matching half-precision models at scale. For
example, TriLM 3.9B is (bit-wise) smaller than the half-precision FloatLM 830M,
but matches half-precision FloatLM 3.9B in commonsense reasoning and knowledge
benchmarks. However, TriLM 3.9B is also as toxic and stereotyping as FloatLM
3.9B, a model six times larger in size. Additionally, TriLM 3.9B lags behind
FloatLM in perplexity on validation splits and web-based corpora but performs
better on less noisy datasets like Lambada and PennTreeBank.
To enhance understanding of low-bitwidth models, we are releasing 500+
intermediate checkpoints of the Spectra suite at
https://github.com/NolanoOrg/SpectraSuite{https://github.com/NolanoOrg/SpectraSuite}.Summary
AI-Generated Summary