Hydragen：具有共享前缀的高吞吐量LLM推断

摘要

基于Transformer的大型语言模型（LLMs）现已部署到数亿用户。LLM推断通常在共享前缀的序列批次上执行，例如少样本示例或聊天机器人系统提示。在这种大批量设置中，解码可能会受到注意力操作的瓶颈影响，该操作从内存中读取大型键值（KV）缓存，并为批次中的每个序列计算低效的矩阵-向量乘积。在这项工作中，我们介绍了Hydragen，这是一种硬件感知的精确注意力实现，具有共享前缀。Hydragen分别计算共享前缀和独特后缀的注意力。这种分解通过跨序列批次一起批量处理查询，从而实现了高效的前缀注意力，减少了冗余的内存读取，并实现了硬件友好的矩阵乘法的使用。我们的方法可以将端到端LLM吞吐量提高多达32倍，超过竞争基线，速度随着批次大小和共享前缀长度的增加而增加。Hydragen还可以使用非常长的共享上下文：在高批次大小下，将前缀长度从1K增加到16K标记，Hydragen吞吐量减少不到15％，而基线的吞吐量下降超过90％。Hydragen不仅适用于简单的前缀-后缀分解，还可应用于基于树的提示共享模式，使我们能够进一步减少在竞争性编程问题上的推断时间，减少55％。

English

Transformer-based large language models (LLMs) are now deployed to hundreds of millions of users. LLM inference is commonly performed on batches of sequences that share a prefix, such as few-shot examples or a chatbot system prompt. Decoding in this large-batch setting can be bottlenecked by the attention operation, which reads large key-value (KV) caches from memory and computes inefficient matrix-vector products for every sequence in the batch. In this work, we introduce Hydragen, a hardware-aware exact implementation of attention with shared prefixes. Hydragen computes attention over the shared prefix and unique suffixes separately. This decomposition enables efficient prefix attention by batching queries together across sequences, reducing redundant memory reads and enabling the use of hardware-friendly matrix multiplications. Our method can improve end-to-end LLM throughput by up to 32x against competitive baselines, with speedup growing with the batch size and shared prefix length. Hydragen also enables the use of very long shared contexts: with a high batch size, increasing the prefix length from 1K to 16K tokens decreases Hydragen throughput by less than 15%, while the throughput of baselines drops by over 90%. Hydragen generalizes beyond simple prefix-suffix decomposition and can be applied to tree-based prompt sharing patterns, allowing us to further reduce inference time on competitive programming problems by 55%.

Hydragen：具有共享前缀的高吞吐量LLM推断

Hydragen: High-Throughput LLM Inference with Shared Prefixes

摘要

Support