MagicDec：透過推測解碼打破長內容生成的延遲-吞吐量折衷。

摘要

大型語言模型（LLMs）在長文本應用中變得更加普遍，例如互動式聊天機器人、文件分析和代理工作流程，但要以低延遲和高吞吐量提供長文本請求具有挑戰性。猜測解碼（SD）是一種廣泛使用的技術，可在不影響性能的情況下降低延遲，但傳統觀點認為其效力僅限於小批量大小。在MagicDec中，我們展示了令人驚訝的結果，即SD即使對於中長序列的高吞吐量推理模式也能實現加速。更有趣的是，根據我們的嚴格分析，一種智能起草策略可以隨著批量大小的增加實現更好的加速。MagicDec首先識別隨著批量大小和序列長度增加而出現的瓶頸轉移，並利用這些見解更有效地部署猜測解碼以進行高吞吐量推理。然後，它利用帶有稀疏KV緩存的起草模型來解決隨著序列長度和批量大小增加而擴展的KV瓶頸。

English

Large Language Models (LLMs) have become more prevalent in long-context applications such as interactive chatbots, document analysis, and agent workflows, but it is challenging to serve long-context requests with low latency and high throughput. Speculative decoding (SD) is a widely used technique to reduce latency without sacrificing performance but the conventional wisdom suggests that its efficacy is limited to small batch sizes. In MagicDec, we show that surprisingly SD can achieve speedup even for a high throughput inference regime for moderate to long sequences. More interestingly, an intelligent drafting strategy can achieve better speedup with increasing batch size based on our rigorous analysis. MagicDec first identifies the bottleneck shifts with increasing batch size and sequence length, and uses these insights to deploy speculative decoding more effectively for high throughput inference. Then, it leverages draft models with sparse KV cache to address the KV bottleneck that scales with both sequence length and batch size.

MagicDec：透過推測解碼打破長內容生成的延遲-吞吐量折衷。

MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding

摘要

Support