any4：面向大语言模型的学习型4位数值表示法

摘要

我们推出了any4，一种针对大型语言模型（LLMs）的4位权重量化解决方案，它能够提供任意的数值表示，而无需对权重或激活进行预处理。与int4、fp4和nf4等其他相关的4位数值表示类型相比，any4在多种模型规模、代际和系列（如Llama 2、Llama 3、Mistral和Mixtral）上的评估中展现出更高的准确性。尽管any4无需权重或激活的预处理，但它同样能够与需要此类预处理的垂直技术（例如AWQ和GPTQ）相媲美。我们还对any3和any2进行了实验，证明了在更低位数下的竞争力。此外，我们展示了一种仅需单个精选多样化样本而非大多数量化方法中使用的数百个数据集样本的校准方法。同时，我们开源了tinygemm，一个为LLMs优化的低延迟GPU矩阵乘法库，它采用GPU高效的查找表策略实现了any4，并支持其他常见的量化方法。我们的代码已在https://github.com/facebookresearch/any4 开源。

English

We present any4, a learned 4-bit weight quantization solution for large language models (LLMs) providing arbitrary numeric representations without requiring pre-processing of weights or activations. any4 yields higher accuracy compared to other related 4-bit numeric representation types: int4, fp4 and nf4, as evaluated on a range of model sizes, generations and families (Llama 2, Llama 3, Mistral and Mixtral). While any4 does not require preprocessing of weights or activations, it is also competitive with orthogonal techniques that require such preprocessing (e.g., AWQ and GPTQ). We also experiment with any3 and any2 and show competitiveness at lower bits. Additionally, we show that we can calibrate using a single curated diverse sample rather than hundreds of samples from a dataset as done in most quantization approaches. We also open source tinygemm, a latency optimized GPU matrix multiplication library for LLMs, that implements any4 using a GPU-efficient lookup table strategy along with other common quantization methods. We open source our code at https://github.com/facebookresearch/any4 .

any4：面向大语言模型的学习型4位数值表示法

any4: Learned 4-bit Numeric Representation for LLMs

摘要

Support