LogQuant: 우수한 정확도 보존을 위한 로그 분포 기반 2비트 KV 캐시 양자화

초록

대규모 언어 모델(LLM) 추론에서 KV 캐시를 위한 획기적인 2비트 양자화 기술인 LogQuant를 소개합니다. 이 기술은 우수한 성능을 유지하면서도 상당한 메모리 절약을 제공합니다. 기존 방법들은 후속 토큰이 더 중요하다고 가정하거나, 이전의 어텐션 패턴을 기반으로 중요한 토큰을 예측하려고 시도했습니다. 그러나 이러한 접근 방식은 성능 병목 현상이나 빈번한 예측 오류를 초래할 수 있습니다. LogQuant는 다른 접근 방식을 취합니다. 로그 기반 필터링 메커니즘을 적용하여 전체 컨텍스트에 걸쳐 KV 캐시를 선택적으로 압축함으로써, 기존 방법과 동일하거나 더 적은 메모리 사용량으로도 더 나은 성능을 달성합니다. 벤치마크 테스트에서 LogQuant는 메모리 소비를 증가시키지 않으면서도 처리량을 25% 향상시키고 배치 크기를 60% 증가시켰습니다. 수학 및 코드 완성과 같은 도전적인 작업에서 LogQuant는 동일한 압축 비율에서 정확도를 40%에서 200%까지 개선하여 유사한 기술들을 능가했습니다. LogQuant는 Python의 transformers 라이브러리와 같은 인기 있는 추론 프레임워크와 원활하게 통합됩니다. 구현은 https://github.com/Concyclics/LogQuantKV에서 확인할 수 있습니다.

English

We introduce LogQuant, a groundbreaking 2-bit quantization technique for KV Cache in large language model (LLM) inference, delivering substantial memory savings while preserving superior performance. Previous methods either assume that later tokens are more important or attempt to predict important tokens based on earlier attention patterns. Both approaches, however, can result in performance bottlenecks or frequent mispredictions. LogQuant takes a different approach. By applying a log-based filtering mechanism, it selectively compresses the KV Cache across the entire context, achieving better performance with the same or even reduced memory footprint compared to existing methods. In benchmark tests, it enhances throughput by 25% and boosts batch size by 60% without increasing memory consumption. For challenging tasks such as Math and Code Completion, LogQuant improves accuracy by 40% to 200% at the same compression ratio, outperforming comparable techniques.LogQuant integrates effortlessly with popular inference frameworks like Python's transformers library. Implementation can be available in https://github.com/Concyclics/LogQuantKV.

LogQuant: 우수한 정확도 보존을 위한 로그 분포 기반 2비트 KV 캐시 양자화

LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation

초록

Support