H_2O: Heavy-Hitter Oracle voor efficiënte generatieve inferentie van grote taalmodellen

Samenvatting

Large Language Models (LLMs), ondanks hun recente indrukwekkende prestaties, zijn opvallend kostbaar om in te zetten, met name voor toepassingen die langere inhoud genereren, zoals dialoogsystemen en verhalen schrijven. Vaak wordt een grote hoeveelheid tijdelijke staatinformatie, bekend als de KV-cache, opgeslagen in het GPU-geheugen, naast de modelparameters, wat lineair schaalt met de sequentielengte en batchgrootte. In dit artikel introduceren we een nieuwe benadering voor het implementeren van de KV-cache die het geheugengebruik aanzienlijk vermindert. Onze benadering is gebaseerd op de opmerkelijke observatie dat een klein deel van de tokens het meeste gewicht in de schaal legt bij het berekenen van aandachtsscores. We noemen deze tokens Heavy Hitters (H_2). Door een uitgebreid onderzoek ontdekken we dat (i) het ontstaan van H_2 natuurlijk is en sterk correleert met de frequente co-voorkomst van tokens in de tekst, en (ii) het verwijderen ervan leidt tot een significante prestatievermindering. Op basis van deze inzichten stellen we Heavy Hitter Oracle (H_2O) voor, een KV-cache-verwijderingsbeleid dat dynamisch een balans behoudt tussen recente en H_2 tokens. We formuleren de KV-cache-verwijdering als een dynamisch submodulair probleem en bewijzen (onder milde aannames) een theoretische garantie voor ons nieuwe verwijderingsalgoritme, wat toekomstig werk zou kunnen begeleiden. We valideren de nauwkeurigheid van ons algoritme met OPT, LLaMA en GPT-NeoX over een breed scala aan taken. Onze implementatie van H_2O met 20% heavy hitters verbetert de doorvoer ten opzichte van drie toonaangevende inferentiesystemen, DeepSpeed Zero-Inference, Hugging Face Accelerate en FlexGen, met respectievelijk tot 29x, 29x en 3x op OPT-6.7B en OPT-30B. Met dezelfde batchgrootte kan H2O de latentie met tot 1.9x verminderen. De code is beschikbaar op https://github.com/FMInference/H2O.

English

Large Language Models (LLMs), despite their recent impressive accomplishments, are notably cost-prohibitive to deploy, particularly for applications involving long-content generation, such as dialogue systems and story writing. Often, a large amount of transient state information, referred to as the KV cache, is stored in GPU memory in addition to model parameters, scaling linearly with the sequence length and batch size. In this paper, we introduce a novel approach for implementing the KV cache which significantly reduces its memory footprint. Our approach is based on the noteworthy observation that a small portion of tokens contributes most of the value when computing attention scores. We call these tokens Heavy Hitters (H_2). Through a comprehensive investigation, we find that (i) the emergence of H_2 is natural and strongly correlates with the frequent co-occurrence of tokens in the text, and (ii) removing them results in significant performance degradation. Based on these insights, we propose Heavy Hitter Oracle (H_2O), a KV cache eviction policy that dynamically retains a balance of recent and H_2 tokens. We formulate the KV cache eviction as a dynamic submodular problem and prove (under mild assumptions) a theoretical guarantee for our novel eviction algorithm which could help guide future work. We validate the accuracy of our algorithm with OPT, LLaMA, and GPT-NeoX across a wide range of tasks. Our implementation of H_2O with 20% heavy hitters improves the throughput over three leading inference systems DeepSpeed Zero-Inference, Hugging Face Accelerate, and FlexGen by up to 29times, 29times, and 3times on OPT-6.7B and OPT-30B. With the same batch size, H2O can reduce the latency by up to 1.9times. The code is available at https://github.com/FMInference/H2O.

H_2O: Heavy-Hitter Oracle voor efficiënte generatieve inferentie van grote taalmodellen

H_2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

Samenvatting

Support