any4:面向大语言模型的学习型4位数值表示法
any4: Learned 4-bit Numeric Representation for LLMs
July 7, 2025
作者: Mostafa Elhoushi, Jeff Johnson
cs.AI
摘要
我们推出了any4,一种针对大型语言模型(LLMs)的4位权重量化解决方案,它能够提供任意的数值表示,而无需对权重或激活进行预处理。与int4、fp4和nf4等其他相关的4位数值表示类型相比,any4在多种模型规模、代际和系列(如Llama 2、Llama 3、Mistral和Mixtral)上的评估中展现出更高的准确性。尽管any4无需权重或激活的预处理,但它同样能够与需要此类预处理的垂直技术(例如AWQ和GPTQ)相媲美。我们还对any3和any2进行了实验,证明了在更低位数下的竞争力。此外,我们展示了一种仅需单个精选多样化样本而非大多数量化方法中使用的数百个数据集样本的校准方法。同时,我们开源了tinygemm,一个为LLMs优化的低延迟GPU矩阵乘法库,它采用GPU高效的查找表策略实现了any4,并支持其他常见的量化方法。我们的代码已在https://github.com/facebookresearch/any4 开源。
English
We present any4, a learned 4-bit weight quantization solution for large
language models (LLMs) providing arbitrary numeric representations without
requiring pre-processing of weights or activations. any4 yields higher accuracy
compared to other related 4-bit numeric representation types: int4, fp4 and
nf4, as evaluated on a range of model sizes, generations and families (Llama 2,
Llama 3, Mistral and Mixtral). While any4 does not require preprocessing of
weights or activations, it is also competitive with orthogonal techniques that
require such preprocessing (e.g., AWQ and GPTQ). We also experiment with any3
and any2 and show competitiveness at lower bits. Additionally, we show that we
can calibrate using a single curated diverse sample rather than hundreds of
samples from a dataset as done in most quantization approaches. We also open
source tinygemm, a latency optimized GPU matrix multiplication library for
LLMs, that implements any4 using a GPU-efficient lookup table strategy along
with other common quantization methods. We open source our code at
https://github.com/facebookresearch/any4 .