ChatPaper.aiChatPaper

KV 预测以提高首次令牌时间。

KV Prediction for Improved Time to First Token

October 10, 2024
作者: Maxwell Horton, Qingqing Cao, Chenfan Sun, Yanzi Jin, Sachin Mehta, Mohammad Rastegari, Moin Nabi
cs.AI

摘要

基于Transformer的语言模型推理始于一个提示处理步骤。在这一步骤中,模型生成第一个输出标记并存储未来生成步骤所需的KV缓存。当提示长度或批处理大小增加时,这个提示处理步骤可能会消耗大量计算资源,在边缘设备上,十亿参数模型可能需要花费10秒或更长时间。这会降低用户体验,引入显著的延迟到模型的输出中。为了减少产生第一个输出所需的时间(称为“首标记时间”,或TTFT)的预训练模型,我们引入了一种名为KV预测的新方法。在我们的方法中,一个小型辅助模型用于处理提示并产生基础模型使用的KV缓存的近似值。然后,这个近似的KV缓存与基础模型一起用于自回归生成,而无需再次查询辅助模型。我们证明,与基线相比,我们的方法在效率和准确性之间产生了帕累托最优的权衡。在TriviaQA上,我们展示了在一系列TTFT FLOPs预算中相对准确性提高了15%至50%。我们还展示了在固定的TTFT FLOPs预算下,对HumanEval Python代码完成的准确性提高了高达30%。此外,我们在Apple M2 Pro CPU上对模型进行基准测试,并展示了我们在FLOPs上的改进如何转化为硬件上的TTFT加速。我们在https://github.com/apple/corenet/tree/main/projects/kv-prediction 上发布了我们的代码。
English
Inference with transformer-based language models begins with a prompt processing step. In this step, the model generates the first output token and stores the KV cache needed for future generation steps. This prompt processing step can be computationally expensive, taking 10s of seconds or more for billion-parameter models on edge devices when prompt lengths or batch sizes rise. This degrades user experience by introducing significant latency into the model's outputs. To reduce the time spent producing the first output (known as the ``time to first token'', or TTFT) of a pretrained model, we introduce a novel method called KV Prediction. In our method, a small auxiliary model is used to process the prompt and produce an approximation of the KV cache used by a base model. This approximated KV cache is then used with the base model for autoregressive generation without the need to query the auxiliary model again. We demonstrate that our method produces a pareto-optimal efficiency-accuracy trade-off when compared to baselines. On TriviaQA, we demonstrate relative accuracy improvements in the range of 15%-50% across a range of TTFT FLOPs budgets. We also demonstrate accuracy improvements of up to 30% on HumanEval python code completion at fixed TTFT FLOPs budgets. Additionally, we benchmark models on an Apple M2 Pro CPU and demonstrate that our improvement in FLOPs translates to a TTFT speedup on hardware. We release our code at https://github.com/apple/corenet/tree/main/projects/kv-prediction .

Summary

AI-Generated Summary

PDF122November 16, 2024