ChatPaper.aiChatPaper

LogQuant:基于对数分布的2位KV缓存量化技术,实现卓越的精度保持

LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation

March 25, 2025
作者: Han Chen, Zicong Jiang, Zining Zhang, Bingsheng He, Pingyi Luo, Mian Lu, Yuqiang Chen
cs.AI

摘要

我们推出LogQuant,这是一项针对大型语言模型(LLM)推理中KV Cache的突破性2位量化技术,在显著节省内存的同时保持了卓越的性能。以往的方法要么假设后续的token更为重要,要么试图基于先前的注意力模式预测重要token。然而,这两种方法都可能导致性能瓶颈或频繁的预测失误。 LogQuant采取了不同的策略。通过应用基于对数的过滤机制,它选择性地在整个上下文中压缩KV Cache,在相同甚至更少的内存占用下实现了更优的性能。在基准测试中,LogQuant将吞吐量提升了25%,并在不增加内存消耗的情况下将批量大小提高了60%。对于数学和代码补全等具有挑战性的任务,LogQuant在相同压缩比下将准确率提升了40%至200%,超越了同类技术。LogQuant能够无缝集成到如Python的transformers库等流行的推理框架中。具体实现可访问https://github.com/Concyclics/LogQuantKV获取。
English
We introduce LogQuant, a groundbreaking 2-bit quantization technique for KV Cache in large language model (LLM) inference, delivering substantial memory savings while preserving superior performance. Previous methods either assume that later tokens are more important or attempt to predict important tokens based on earlier attention patterns. Both approaches, however, can result in performance bottlenecks or frequent mispredictions. LogQuant takes a different approach. By applying a log-based filtering mechanism, it selectively compresses the KV Cache across the entire context, achieving better performance with the same or even reduced memory footprint compared to existing methods. In benchmark tests, it enhances throughput by 25% and boosts batch size by 60% without increasing memory consumption. For challenging tasks such as Math and Code Completion, LogQuant improves accuracy by 40% to 200% at the same compression ratio, outperforming comparable techniques.LogQuant integrates effortlessly with popular inference frameworks like Python's transformers library. Implementation can be available in https://github.com/Concyclics/LogQuantKV.

Summary

AI-Generated Summary

PDF112March 27, 2025