H_2O: 대규모 언어 모델의 효율적 생성 추론을 위한 헤비히터 오라클

초록

대형 언어 모델(LLMs)은 최근 인상적인 성과를 거두었음에도 불구하고, 특히 대화 시스템이나 이야기 작성과 같은 장문 생성이 필요한 애플리케이션에서 배포 비용이 상당히 높은 것으로 알려져 있다. 종종 모델 파라미터 외에도 GPU 메모리에 KV 캐시라고 불리는 대량의 일시적 상태 정보가 저장되며, 이는 시퀀스 길이와 배치 크기에 선형적으로 비례하여 증가한다. 본 논문에서는 KV 캐시의 메모리 사용량을 크게 줄이는 새로운 접근 방식을 소개한다. 우리의 접근 방식은 주의 점수를 계산할 때 소수의 토큰이 대부분의 가치를 제공한다는 주목할 만한 관찰에 기반을 두고 있다. 이러한 토큰을 헤비 히터(H_2)라고 부른다. 포괄적인 연구를 통해 우리는 (i) H_2의 출현이 자연스럽고 텍스트 내 토큰의 빈번한 동시 발생과 강한 상관관계가 있으며, (ii) 이를 제거하면 성능이 크게 저하된다는 사실을 발견했다. 이러한 통찰을 바탕으로, 우리는 최근 토큰과 H_2 토큰 간의 균형을 동적으로 유지하는 KV 캐시 제거 정책인 헤비 히터 오라클(H_2O)을 제안한다. 우리는 KV 캐시 제거를 동적 서브모듈 문제로 공식화하고, (약간의 가정 하에서) 우리의 새로운 제거 알고리즘에 대한 이론적 보장을 증명하여 향후 연구를 안내할 수 있도록 한다. 우리는 OPT, LLaMA, GPT-NeoX를 사용하여 다양한 작업에서 우리 알고리즘의 정확성을 검증했다. 20%의 헤비 히터를 사용한 H_2O 구현은 OPT-6.7B와 OPT-30B에서 DeepSpeed Zero-Inference, Hugging Face Accelerate, FlexGen과 같은 세 가지 주요 추론 시스템에 비해 처리량을 각각 최대 29배, 29배, 3배까지 향상시켰다. 동일한 배치 크기에서 H2O는 지연 시간을 최대 1.9배까지 줄일 수 있다. 코드는 https://github.com/FMInference/H2O에서 확인할 수 있다.

English

Large Language Models (LLMs), despite their recent impressive accomplishments, are notably cost-prohibitive to deploy, particularly for applications involving long-content generation, such as dialogue systems and story writing. Often, a large amount of transient state information, referred to as the KV cache, is stored in GPU memory in addition to model parameters, scaling linearly with the sequence length and batch size. In this paper, we introduce a novel approach for implementing the KV cache which significantly reduces its memory footprint. Our approach is based on the noteworthy observation that a small portion of tokens contributes most of the value when computing attention scores. We call these tokens Heavy Hitters (H_2). Through a comprehensive investigation, we find that (i) the emergence of H_2 is natural and strongly correlates with the frequent co-occurrence of tokens in the text, and (ii) removing them results in significant performance degradation. Based on these insights, we propose Heavy Hitter Oracle (H_2O), a KV cache eviction policy that dynamically retains a balance of recent and H_2 tokens. We formulate the KV cache eviction as a dynamic submodular problem and prove (under mild assumptions) a theoretical guarantee for our novel eviction algorithm which could help guide future work. We validate the accuracy of our algorithm with OPT, LLaMA, and GPT-NeoX across a wide range of tasks. Our implementation of H_2O with 20% heavy hitters improves the throughput over three leading inference systems DeepSpeed Zero-Inference, Hugging Face Accelerate, and FlexGen by up to 29times, 29times, and 3times on OPT-6.7B and OPT-30B. With the same batch size, H2O can reduce the latency by up to 1.9times. The code is available at https://github.com/FMInference/H2O.

H_2O: 대규모 언어 모델의 효율적 생성 추론을 위한 헤비히터 오라클

H_2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

초록

Support