套娃量化
Matryoshka Quantization
February 10, 2025
作者: Pranav Nair, Puranjay Datta, Jeff Dean, Prateek Jain, Aditya Kusupati
cs.AI
摘要
對模型權重進行量化對於降低大型模型的通訊和推論成本至關重要。然而,將模型量化至低精度(如int4或int2)需要在模型質量上進行權衡;特別是int2已知會嚴重降低模型質量。因此,從業者通常被迫維護具有不同量化級別的多個模型,或提供最符合質量-延遲權衡的單一模型。另一方面,整數數據類型(如int8)本質上具有嵌套(Matryoshka)結構,其中較小位寬的整數(如int4或int2)嵌套在最顯著的位元中。本文提出了Matryoshka量化(MatQuant),這是一種新穎的多尺度量化技術,解決了需要多個量化模型的挑戰。它允許訓練和維護僅一個模型,然後可以以不同精度級別提供服務。此外,由於MatQuant提供的共同訓練和共同蒸餾正則化,由MatQuant提取的int2精度模型可能比標準int2量化(使用QAT或OmniQuant等技術)準確度高達10%。這代表了模型量化方面的重大進展,事實證明,在相同設定下,一個int2 FFN-量化的Gemma-2 9B模型比一個int8 FFN-量化的Gemma-2 2B模型更準確。
English
Quantizing model weights is critical for reducing the communication and
inference costs of large models. However, quantizing models -- especially to
low precisions like int4 or int2 -- requires a trade-off in model quality;
int2, in particular, is known to severely degrade model quality. Consequently,
practitioners are often forced to maintain multiple models with different
quantization levels or serve a single model that best satisfies the
quality-latency trade-off. On the other hand, integer data types, such as int8,
inherently possess a nested (Matryoshka) structure where smaller bit-width
integers, like int4 or int2, are nested within the most significant bits. This
paper proposes Matryoshka Quantization (MatQuant), a novel multi-scale
quantization technique that addresses the challenge of needing multiple
quantized models. It allows training and maintaining just one model, which can
then be served at different precision levels. Furthermore, due to the
co-training and co-distillation regularization provided by MatQuant, the int2
precision models extracted by MatQuant can be up to 10% more accurate than
standard int2 quantization (using techniques like QAT or OmniQuant). This
represents significant progress in model quantization, demonstrated by the fact
that, with the same recipe, an int2 FFN-quantized Gemma-2 9B model is more
accurate than an int8 FFN-quantized Gemma-2 2B model.Summary
AI-Generated Summary