ChatPaper.aiChatPaper

基于草稿的大语言模型近似推理

Draft-based Approximate Inference for LLMs

June 10, 2025
作者: Kevin Galim, Ethan Ewer, Wonjun Kang, Minjae Lee, Hyung Il Koo, Kangwook Lee
cs.AI

摘要

优化长上下文大语言模型(LLMs)的推理过程日益重要,这源于Transformer模型在计算上的二次方复杂度和内存上的线性复杂度。现有的近似方法,如键值(KV)缓存丢弃、稀疏注意力机制以及提示压缩,通常依赖于对令牌或KV对重要性的粗略预测。我们提出了一种新颖的近似LLM推理框架,该框架利用小型草稿模型更精确地预测令牌和KV对的重要性。具体而言,我们引入了该框架的两个实例化方案:(i)SpecKV,它通过草稿输出来准确评估每个KV对的重要性,从而实现更有效的KV缓存丢弃;(ii)SpecPC,它利用草稿模型的注意力激活来识别并舍弃不重要的提示令牌。据我们所知,这是首次将草稿模型用于近似LLM推理加速,扩展了其传统无损推测解码之外的应用价值。我们通过理论与实证分析为方法提供依据,并展示了草稿模型与目标模型注意力模式之间的强相关性。在长上下文基准测试上的广泛实验表明,我们的方法在保持内存使用、延迟和吞吐量改进的同时,始终比现有基线达到更高的准确率。我们的代码已发布于https://github.com/furiosa-ai/draft-based-approx-llm。
English
Optimizing inference for long-context Large Language Models (LLMs) is increasingly important due to the quadratic compute and linear memory complexity of Transformers. Existing approximation methods, such as key-value (KV) cache dropping, sparse attention, and prompt compression, typically rely on rough predictions of token or KV pair importance. We propose a novel framework for approximate LLM inference that leverages small draft models to more accurately predict the importance of tokens and KV pairs. Specifically, we introduce two instantiations of our proposed framework: (i) SpecKV, which leverages a draft output to accurately assess the importance of each KV pair for more effective KV cache dropping, and (ii) SpecPC, which uses the draft model's attention activations to identify and discard unimportant prompt tokens. To the best of our knowledge, this is the first work to use draft models for approximate LLM inference acceleration, extending their utility beyond traditional lossless speculative decoding. We motivate our methods with theoretical and empirical analyses, and show a strong correlation between the attention patterns of draft and target models. Extensive experiments on long-context benchmarks show that our methods consistently achieve higher accuracy than existing baselines, while preserving the same improvements in memory usage, latency, and throughput. Our code is available at https://github.com/furiosa-ai/draft-based-approx-llm.
PDF32June 13, 2025