MagicDec:通过推测解码打破长上下文生成的延迟-吞吐量权衡
MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding
August 20, 2024
作者: Jian Chen, Vashisth Tiwari, Ranajoy Sadhukhan, Zhuoming Chen, Jinyuan Shi, Ian En-Hsu Yen, Beidi Chen
cs.AI
摘要
大型语言模型(LLMs)在长文本应用中变得更加普遍,如交互式聊天机器人、文档分析和代理工作流,但是要在低延迟和高吞吐量下处理长文本请求是具有挑战性的。猜测解码(SD)是一种广泛使用的技术,可以在不牺牲性能的情况下降低延迟,但传统观点认为其有效性仅限于小批量大小。在MagicDec中,我们展示了令人惊讶的是,即使对于中等到长序列的高吞吐推理模式,SD也能实现加速。更有趣的是,基于我们的严格分析,一种智能起草策略可以随着批量大小的增加实现更好的加速。MagicDec首先识别随着批量大小和序列长度增加而出现的瓶颈转移,并利用这些见解更有效地部署猜测解码以实现高吞吐推理。然后,它利用具有稀疏KV缓存的起草模型来解决随着序列长度和批量大小增加而扩展的KV瓶颈。
English
Large Language Models (LLMs) have become more prevalent in long-context
applications such as interactive chatbots, document analysis, and agent
workflows, but it is challenging to serve long-context requests with low
latency and high throughput. Speculative decoding (SD) is a widely used
technique to reduce latency without sacrificing performance but the
conventional wisdom suggests that its efficacy is limited to small batch sizes.
In MagicDec, we show that surprisingly SD can achieve speedup even for a high
throughput inference regime for moderate to long sequences. More interestingly,
an intelligent drafting strategy can achieve better speedup with increasing
batch size based on our rigorous analysis. MagicDec first identifies the
bottleneck shifts with increasing batch size and sequence length, and uses
these insights to deploy speculative decoding more effectively for high
throughput inference. Then, it leverages draft models with sparse KV cache to
address the KV bottleneck that scales with both sequence length and batch size.Summary
AI-Generated Summary