マトリョーシカ量子化

要旨

モデルの重みを量子化することは、大規模モデルの通信および推論コストを削減するために重要です。ただし、特にint4やint2などの低精度にモデルを量子化することは、モデルの品質とのトレードオフが必要とされます。特にint2は、モデルの品質を著しく低下させることが知られています。そのため、実務家はしばしば異なる量子化レベルを持つ複数のモデルを維持するか、品質とレイテンシのトレードオフを最も満たす単一のモデルを提供することを余儀なくされます。一方、int8などの整数データ型は、より小さなビット幅の整数（例：int4やint2）が最も重要なビット内にネストされた（マトリョーシカ）構造を持っています。本論文では、Matryoshka Quantization（MatQuant）という新しい多スケール量子化技術を提案し、複数の量子化されたモデルが必要とされる課題に対処します。これにより、訓練および維持する必要があるのは1つのモデルだけであり、それを異なる精度レベルで提供することが可能となります。さらに、MatQuantによって提供される共同訓練および共同蒸留の正則化により、MatQuantによって抽出されたint2精度モデルは、QATやOmniQuantなどの手法を使用した標準的なint2量子化よりも最大10%精度が向上します。これは、同じレシピを使用した場合、int2 FFN-量子化Gemma-2 9Bモデルがint8 FFN-量子化Gemma-2 2Bモデルよりも正確であるという事実によって示される、モデルの量子化における重要な進歩を表しています。

English

Quantizing model weights is critical for reducing the communication and inference costs of large models. However, quantizing models -- especially to low precisions like int4 or int2 -- requires a trade-off in model quality; int2, in particular, is known to severely degrade model quality. Consequently, practitioners are often forced to maintain multiple models with different quantization levels or serve a single model that best satisfies the quality-latency trade-off. On the other hand, integer data types, such as int8, inherently possess a nested (Matryoshka) structure where smaller bit-width integers, like int4 or int2, are nested within the most significant bits. This paper proposes Matryoshka Quantization (MatQuant), a novel multi-scale quantization technique that addresses the challenge of needing multiple quantized models. It allows training and maintaining just one model, which can then be served at different precision levels. Furthermore, due to the co-training and co-distillation regularization provided by MatQuant, the int2 precision models extracted by MatQuant can be up to 10% more accurate than standard int2 quantization (using techniques like QAT or OmniQuant). This represents significant progress in model quantization, demonstrated by the fact that, with the same recipe, an int2 FFN-quantized Gemma-2 9B model is more accurate than an int8 FFN-quantized Gemma-2 2B model.