SnapKV: LLM은 생성 전에 사용자가 찾고 있는 것을 알고 있습니다

초록

대규모 언어 모델(LLMs)은 광범위한 문맥을 처리하는 데 있어 놀라운 진전을 이루었으며, 이 과정에서 키-값(KV) 캐시가 성능 향상에 중요한 역할을 하고 있습니다. 그러나 입력 길이가 증가함에 따라 KV 캐시의 크기가 커지면서 메모리와 시간 효율성에 문제가 발생합니다. 이 문제를 해결하기 위해, 본 논문은 실제 응용에서도 비슷한 성능을 유지하면서 KV 캐시 크기를 효율적으로 최소화하는 혁신적이고 미세 조정이 필요 없는 접근 방식인 SnapKV를 소개합니다. 우리는 모델의 각 어텐션 헤드가 생성 과정에서 특정 프롬프트 어텐션 특징에 지속적으로 주목한다는 사실을 발견했습니다. 동시에, 이러한 강력한 패턴은 프롬프트 끝에 위치한 '관찰' 창에서 얻을 수 있습니다. 이러한 통찰을 바탕으로, SnapKV는 각 어텐션 헤드에 대해 중요한 KV 위치를 클러스터링하여 선택함으로써 KV 캐시를 자동으로 압축합니다. 우리의 접근 방식은 긴 입력 시퀀스를 처리할 때 증가하는 계산 오버헤드와 메모리 사용량을 크게 줄입니다. 구체적으로, SnapKV는 16K 토큰의 입력을 처리할 때 기준 모델 대비 3.6배 빠른 생성 속도와 8.2배 향상된 메모리 효율성을 유지하면서도 일관된 디코딩 속도를 달성합니다. 동시에, 16개의 긴 시퀀스 데이터셋에서 기준 모델과 비슷한 성능을 유지합니다. 또한, SnapKV는 HuggingFace 구현을 약간 수정하여 단일 A100-80GB GPU에서 최대 380K 문맥 토큰을 처리할 수 있으며, Needle-in-a-Haystack 테스트에서도 정확도 저하가 거의 없음을 보여줍니다. 더 포괄적인 연구 결과는 SnapKV가 실용적인 응용에 적합할 가능성을 시사합니다.

English

Large Language Models (LLMs) have made remarkable progress in processing extensive contexts, with the Key-Value (KV) cache playing a vital role in enhancing their performance. However, the growth of the KV cache in response to increasing input length poses challenges to memory and time efficiency. To address this problem, this paper introduces SnapKV, an innovative and fine-tuning-free approach that efficiently minimizes KV cache size while still delivering comparable performance in real-world applications. We discover that each attention head in the model consistently focuses on specific prompt attention features during generation. Meanwhile, this robust pattern can be obtained from an `observation' window located at the end of the prompts. Drawing on this insight, SnapKV automatically compresses KV caches by selecting clustered important KV positions for each attention head. Our approach significantly reduces the growing computational overhead and memory footprint when processing long input sequences. Specifically, SnapKV achieves a consistent decoding speed with a 3.6x increase in generation speed and an 8.2x enhancement in memory efficiency compared to baseline when processing inputs of 16K tokens. At the same time, it maintains comparable performance to baseline models across 16 long sequence datasets. Moreover, SnapKV can process up to 380K context tokens on a single A100-80GB GPU using HuggingFace implementation with minor changes, exhibiting only a negligible accuracy drop in the Needle-in-a-Haystack test. Further comprehensive studies suggest SnapKV's potential for practical applications.

SnapKV: LLM은 생성 전에 사용자가 찾고 있는 것을 알고 있습니다

SnapKV: LLM Knows What You are Looking for Before Generation

초록

Support