ChatPaper.aiChatPaper

H_2O:大型语言模型高效生成推理的重要信息预测器

H_2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

June 24, 2023
作者: Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, Beidi Chen
cs.AI

摘要

尽管大型语言模型(LLMs)最近取得了令人瞩目的成就,但在部署方面成本昂贵,特别是对于涉及长内容生成的应用,如对话系统和故事写作。通常,除了模型参数外,存储在GPU内存中的大量瞬态状态信息,称为KV缓存,随着序列长度和批量大小呈线性扩展。在本文中,我们介绍了一种实现KV缓存的新方法,显著减少了其内存占用。我们的方法基于一个显著观察,即在计算注意力分数时,一小部分标记贡献了大部分价值。我们称这些标记为重要标记(H_2)。通过全面调查,我们发现(i)H_2的出现是自然的,并且与文本中标记的频繁共现强相关,以及(ii)删除它们会导致显著的性能下降。基于这些观察,我们提出了重要标记预测器(H_2O),这是一种KV缓存驱逐策略,动态保留最近和H_2标记的平衡。我们将KV缓存驱逐问题形式化为一个动态子模块问题,并证明(在温和假设下)我们的新型驱逐算法具有理论保证,可以帮助指导未来的工作。我们通过OPT、LLaMA和GPT-NeoX在广泛的任务范围内验证了我们算法的准确性。我们使用20%重要标记实现的H_2O相比三个领先的推理系统DeepSpeed Zero-Inference、Hugging Face Accelerate和FlexGen,将OPT-6.7B和OPT-30B的吞吐量提高了最多29倍、29倍和3倍。在相同的批量大小下,H_2O可以将延迟降低最多1.9倍。代码可在https://github.com/FMInference/H2O找到。
English
Large Language Models (LLMs), despite their recent impressive accomplishments, are notably cost-prohibitive to deploy, particularly for applications involving long-content generation, such as dialogue systems and story writing. Often, a large amount of transient state information, referred to as the KV cache, is stored in GPU memory in addition to model parameters, scaling linearly with the sequence length and batch size. In this paper, we introduce a novel approach for implementing the KV cache which significantly reduces its memory footprint. Our approach is based on the noteworthy observation that a small portion of tokens contributes most of the value when computing attention scores. We call these tokens Heavy Hitters (H_2). Through a comprehensive investigation, we find that (i) the emergence of H_2 is natural and strongly correlates with the frequent co-occurrence of tokens in the text, and (ii) removing them results in significant performance degradation. Based on these insights, we propose Heavy Hitter Oracle (H_2O), a KV cache eviction policy that dynamically retains a balance of recent and H_2 tokens. We formulate the KV cache eviction as a dynamic submodular problem and prove (under mild assumptions) a theoretical guarantee for our novel eviction algorithm which could help guide future work. We validate the accuracy of our algorithm with OPT, LLaMA, and GPT-NeoX across a wide range of tasks. Our implementation of H_2O with 20% heavy hitters improves the throughput over three leading inference systems DeepSpeed Zero-Inference, Hugging Face Accelerate, and FlexGen by up to 29times, 29times, and 3times on OPT-6.7B and OPT-30B. With the same batch size, H2O can reduce the latency by up to 1.9times. The code is available at https://github.com/FMInference/H2O.
PDF121December 15, 2024