any4:面向大語言模型的4位數值表示學習方法
any4: Learned 4-bit Numeric Representation for LLMs
July 7, 2025
作者: Mostafa Elhoushi, Jeff Johnson
cs.AI
摘要
我們提出了any4,這是一種針對大型語言模型(LLMs)的學習型4位元權重量化解決方案,它能夠提供任意的數值表示,而無需對權重或激活進行預處理。在對多種模型規模、代次和系列(如Llama 2、Llama 3、Mistral和Mixtral)的評估中,any4相比其他相關的4位元數值表示類型(如int4、fp4和nf4)展現出更高的準確性。儘管any4不需要對權重或激活進行預處理,但它與需要此類預處理的正交技術(例如AWQ和GPTQ)相比也具備競爭力。我們還對any3和any2進行了實驗,並展示了在更低位元下的競爭力。此外,我們證明可以使用單一精心挑選的多樣化樣本進行校準,而非像大多數量化方法那樣需要從數據集中抽取數百個樣本。我們還開源了tinygemm,這是一個針對LLMs的延遲優化GPU矩陣乘法庫,它通過GPU高效的查找表策略實現了any4,並支持其他常見的量化方法。我們的代碼已開源於https://github.com/facebookresearch/any4。
English
We present any4, a learned 4-bit weight quantization solution for large
language models (LLMs) providing arbitrary numeric representations without
requiring pre-processing of weights or activations. any4 yields higher accuracy
compared to other related 4-bit numeric representation types: int4, fp4 and
nf4, as evaluated on a range of model sizes, generations and families (Llama 2,
Llama 3, Mistral and Mixtral). While any4 does not require preprocessing of
weights or activations, it is also competitive with orthogonal techniques that
require such preprocessing (e.g., AWQ and GPTQ). We also experiment with any3
and any2 and show competitiveness at lower bits. Additionally, we show that we
can calibrate using a single curated diverse sample rather than hundreds of
samples from a dataset as done in most quantization approaches. We also open
source tinygemm, a latency optimized GPU matrix multiplication library for
LLMs, that implements any4 using a GPU-efficient lookup table strategy along
with other common quantization methods. We open source our code at
https://github.com/facebookresearch/any4 .