LogQuant: 優れた精度維持を実現するKVキャッシュの対数分布型2ビット量子化

要旨

大規模言語モデル（LLM）推論におけるKVキャッシュの画期的な2ビット量子化技術「LogQuant」を紹介します。本手法は、優れた性能を維持しながら大幅なメモリ節約を実現します。従来の手法では、後のトークンがより重要であると仮定するか、以前のアテンションパターンに基づいて重要なトークンを予測しようとしていました。しかし、これらのアプローチでは性能のボトルネックや頻繁な予測ミスが生じる可能性があります。 LogQuantは異なるアプローチを採用しています。対数ベースのフィルタリングメカニズムを適用することで、コンテキスト全体にわたってKVキャッシュを選択的に圧縮し、既存の手法と同等またはそれ以下のメモリ使用量でより優れた性能を達成します。ベンチマークテストでは、メモリ消費を増やすことなくスループットを25％向上させ、バッチサイズを60％増加させました。数学やコード補完などの難しいタスクでは、同じ圧縮率で精度を40％から200％向上させ、類似の技術を凌駕しています。LogQuantは、Pythonのtransformersライブラリのような人気のある推論フレームワークとシームレスに統合されます。実装はhttps://github.com/Concyclics/LogQuantKVで入手可能です。

English

We introduce LogQuant, a groundbreaking 2-bit quantization technique for KV Cache in large language model (LLM) inference, delivering substantial memory savings while preserving superior performance. Previous methods either assume that later tokens are more important or attempt to predict important tokens based on earlier attention patterns. Both approaches, however, can result in performance bottlenecks or frequent mispredictions. LogQuant takes a different approach. By applying a log-based filtering mechanism, it selectively compresses the KV Cache across the entire context, achieving better performance with the same or even reduced memory footprint compared to existing methods. In benchmark tests, it enhances throughput by 25% and boosts batch size by 60% without increasing memory consumption. For challenging tasks such as Math and Code Completion, LogQuant improves accuracy by 40% to 200% at the same compression ratio, outperforming comparable techniques.LogQuant integrates effortlessly with popular inference frameworks like Python's transformers library. Implementation can be available in https://github.com/Concyclics/LogQuantKV.

LogQuant: 優れた精度維持を実現するKVキャッシュの対数分布型2ビット量子化

LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation

要旨

Support