any4：面向大語言模型的4位數值表示學習方法

摘要

我們提出了any4，這是一種針對大型語言模型（LLMs）的學習型4位元權重量化解決方案，它能夠提供任意的數值表示，而無需對權重或激活進行預處理。在對多種模型規模、代次和系列（如Llama 2、Llama 3、Mistral和Mixtral）的評估中，any4相比其他相關的4位元數值表示類型（如int4、fp4和nf4）展現出更高的準確性。儘管any4不需要對權重或激活進行預處理，但它與需要此類預處理的正交技術（例如AWQ和GPTQ）相比也具備競爭力。我們還對any3和any2進行了實驗，並展示了在更低位元下的競爭力。此外，我們證明可以使用單一精心挑選的多樣化樣本進行校準，而非像大多數量化方法那樣需要從數據集中抽取數百個樣本。我們還開源了tinygemm，這是一個針對LLMs的延遲優化GPU矩陣乘法庫，它通過GPU高效的查找表策略實現了any4，並支持其他常見的量化方法。我們的代碼已開源於https://github.com/facebookresearch/any4。

English

We present any4, a learned 4-bit weight quantization solution for large language models (LLMs) providing arbitrary numeric representations without requiring pre-processing of weights or activations. any4 yields higher accuracy compared to other related 4-bit numeric representation types: int4, fp4 and nf4, as evaluated on a range of model sizes, generations and families (Llama 2, Llama 3, Mistral and Mixtral). While any4 does not require preprocessing of weights or activations, it is also competitive with orthogonal techniques that require such preprocessing (e.g., AWQ and GPTQ). We also experiment with any3 and any2 and show competitiveness at lower bits. Additionally, we show that we can calibrate using a single curated diverse sample rather than hundreds of samples from a dataset as done in most quantization approaches. We also open source tinygemm, a latency optimized GPU matrix multiplication library for LLMs, that implements any4 using a GPU-efficient lookup table strategy along with other common quantization methods. We open source our code at https://github.com/facebookresearch/any4 .

any4：面向大語言模型的4位數值表示學習方法

any4: Learned 4-bit Numeric Representation for LLMs

摘要

Support