Hydragen:具有共享前缀的高吞吐量LLM推断
Hydragen: High-Throughput LLM Inference with Shared Prefixes
February 7, 2024
作者: Jordan Juravsky, Bradley Brown, Ryan Ehrlich, Daniel Y. Fu, Christopher Ré, Azalia Mirhoseini
cs.AI
摘要
基于Transformer的大型语言模型(LLMs)现已部署到数亿用户。LLM推断通常在共享前缀的序列批次上执行,例如少样本示例或聊天机器人系统提示。在这种大批量设置中,解码可能会受到注意力操作的瓶颈影响,该操作从内存中读取大型键值(KV)缓存,并为批次中的每个序列计算低效的矩阵-向量乘积。在这项工作中,我们介绍了Hydragen,这是一种硬件感知的精确注意力实现,具有共享前缀。Hydragen分别计算共享前缀和独特后缀的注意力。这种分解通过跨序列批次一起批量处理查询,从而实现了高效的前缀注意力,减少了冗余的内存读取,并实现了硬件友好的矩阵乘法的使用。我们的方法可以将端到端LLM吞吐量提高多达32倍,超过竞争基线,速度随着批次大小和共享前缀长度的增加而增加。Hydragen还可以使用非常长的共享上下文:在高批次大小下,将前缀长度从1K增加到16K标记,Hydragen吞吐量减少不到15%,而基线的吞吐量下降超过90%。Hydragen不仅适用于简单的前缀-后缀分解,还可应用于基于树的提示共享模式,使我们能够进一步减少在竞争性编程问题上的推断时间,减少55%。
English
Transformer-based large language models (LLMs) are now deployed to hundreds
of millions of users. LLM inference is commonly performed on batches of
sequences that share a prefix, such as few-shot examples or a chatbot system
prompt. Decoding in this large-batch setting can be bottlenecked by the
attention operation, which reads large key-value (KV) caches from memory and
computes inefficient matrix-vector products for every sequence in the batch. In
this work, we introduce Hydragen, a hardware-aware exact implementation of
attention with shared prefixes. Hydragen computes attention over the shared
prefix and unique suffixes separately. This decomposition enables efficient
prefix attention by batching queries together across sequences, reducing
redundant memory reads and enabling the use of hardware-friendly matrix
multiplications. Our method can improve end-to-end LLM throughput by up to 32x
against competitive baselines, with speedup growing with the batch size and
shared prefix length. Hydragen also enables the use of very long shared
contexts: with a high batch size, increasing the prefix length from 1K to 16K
tokens decreases Hydragen throughput by less than 15%, while the throughput of
baselines drops by over 90%. Hydragen generalizes beyond simple prefix-suffix
decomposition and can be applied to tree-based prompt sharing patterns,
allowing us to further reduce inference time on competitive programming
problems by 55%.