套娃量化 - 論文詳情

摘要

對模型權重進行量化對於降低大型模型的通訊和推論成本至關重要。然而，將模型量化至低精度（如int4或int2）需要在模型質量上進行權衡；特別是int2已知會嚴重降低模型質量。因此，從業者通常被迫維護具有不同量化級別的多個模型，或提供最符合質量-延遲權衡的單一模型。另一方面，整數數據類型（如int8）本質上具有嵌套（Matryoshka）結構，其中較小位寬的整數（如int4或int2）嵌套在最顯著的位元中。本文提出了Matryoshka量化（MatQuant），這是一種新穎的多尺度量化技術，解決了需要多個量化模型的挑戰。它允許訓練和維護僅一個模型，然後可以以不同精度級別提供服務。此外，由於MatQuant提供的共同訓練和共同蒸餾正則化，由MatQuant提取的int2精度模型可能比標準int2量化（使用QAT或OmniQuant等技術）準確度高達10%。這代表了模型量化方面的重大進展，事實證明，在相同設定下，一個int2 FFN-量化的Gemma-2 9B模型比一個int8 FFN-量化的Gemma-2 2B模型更準確。

English

Quantizing model weights is critical for reducing the communication and inference costs of large models. However, quantizing models -- especially to low precisions like int4 or int2 -- requires a trade-off in model quality; int2, in particular, is known to severely degrade model quality. Consequently, practitioners are often forced to maintain multiple models with different quantization levels or serve a single model that best satisfies the quality-latency trade-off. On the other hand, integer data types, such as int8, inherently possess a nested (Matryoshka) structure where smaller bit-width integers, like int4 or int2, are nested within the most significant bits. This paper proposes Matryoshka Quantization (MatQuant), a novel multi-scale quantization technique that addresses the challenge of needing multiple quantized models. It allows training and maintaining just one model, which can then be served at different precision levels. Furthermore, due to the co-training and co-distillation regularization provided by MatQuant, the int2 precision models extracted by MatQuant can be up to 10% more accurate than standard int2 quantization (using techniques like QAT or OmniQuant). This represents significant progress in model quantization, demonstrated by the fact that, with the same recipe, an int2 FFN-quantized Gemma-2 9B model is more accurate than an int8 FFN-quantized Gemma-2 2B model.