LogQuant:具有卓越精度保持的KV缓存对数分布2位量化
LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation
March 25, 2025
作者: Han Chen, Zicong Jiang, Zining Zhang, Bingsheng He, Pingyi Luo, Mian Lu, Yuqiang Chen
cs.AI
摘要
我們推出LogQuant,這是一項針對大型語言模型(LLM)推理中KV Cache的突破性2位元量化技術,在保持卓越性能的同時,實現了顯著的記憶體節省。先前的方法要么假設後續的token更為重要,要么嘗試基於早期的注意力模式來預測重要token。然而,這兩種方法都可能導致性能瓶頸或頻繁的預測錯誤。
LogQuant採用了不同的策略。通過應用基於對數的過濾機制,它選擇性地壓縮整個上下文中的KV Cache,在相同甚至更少的記憶體佔用下,實現了比現有方法更好的性能。在基準測試中,它將吞吐量提高了25%,並在不增加記憶體消耗的情況下,將批次大小提升了60%。對於數學和代碼完成等具有挑戰性的任務,LogQuant在相同的壓縮比下,將準確率提高了40%至200%,超越了同類技術。LogQuant無縫集成於流行的推理框架,如Python的transformers庫。實現可於https://github.com/Concyclics/LogQuantKV獲取。
English
We introduce LogQuant, a groundbreaking 2-bit quantization technique for KV
Cache in large language model (LLM) inference, delivering substantial memory
savings while preserving superior performance. Previous methods either assume
that later tokens are more important or attempt to predict important tokens
based on earlier attention patterns. Both approaches, however, can result in
performance bottlenecks or frequent mispredictions.
LogQuant takes a different approach. By applying a log-based filtering
mechanism, it selectively compresses the KV Cache across the entire context,
achieving better performance with the same or even reduced memory footprint
compared to existing methods. In benchmark tests, it enhances throughput by 25%
and boosts batch size by 60% without increasing memory consumption. For
challenging tasks such as Math and Code Completion, LogQuant improves accuracy
by 40% to 200% at the same compression ratio, outperforming comparable
techniques.LogQuant integrates effortlessly with popular inference frameworks
like Python's transformers library. Implementation can be available in
https://github.com/Concyclics/LogQuantKV.Summary
AI-Generated Summary