MagicDec：通过推测解码打破长上下文生成的延迟-吞吐量权衡

摘要

大型语言模型（LLMs）在长文本应用中变得更加普遍，如交互式聊天机器人、文档分析和代理工作流，但是要在低延迟和高吞吐量下处理长文本请求是具有挑战性的。猜测解码（SD）是一种广泛使用的技术，可以在不牺牲性能的情况下降低延迟，但传统观点认为其有效性仅限于小批量大小。在MagicDec中，我们展示了令人惊讶的是，即使对于中等到长序列的高吞吐推理模式，SD也能实现加速。更有趣的是，基于我们的严格分析，一种智能起草策略可以随着批量大小的增加实现更好的加速。MagicDec首先识别随着批量大小和序列长度增加而出现的瓶颈转移，并利用这些见解更有效地部署猜测解码以实现高吞吐推理。然后，它利用具有稀疏KV缓存的起草模型来解决随着序列长度和批量大小增加而扩展的KV瓶颈。

English

Large Language Models (LLMs) have become more prevalent in long-context applications such as interactive chatbots, document analysis, and agent workflows, but it is challenging to serve long-context requests with low latency and high throughput. Speculative decoding (SD) is a widely used technique to reduce latency without sacrificing performance but the conventional wisdom suggests that its efficacy is limited to small batch sizes. In MagicDec, we show that surprisingly SD can achieve speedup even for a high throughput inference regime for moderate to long sequences. More interestingly, an intelligent drafting strategy can achieve better speedup with increasing batch size based on our rigorous analysis. MagicDec first identifies the bottleneck shifts with increasing batch size and sequence length, and uses these insights to deploy speculative decoding more effectively for high throughput inference. Then, it leverages draft models with sparse KV cache to address the KV bottleneck that scales with both sequence length and batch size.

MagicDec：通过推测解码打破长上下文生成的延迟-吞吐量权衡

MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding

摘要

Support