提示缓存:用于低延迟推断的模块化注意力重用
Prompt Cache: Modular Attention Reuse for Low-Latency Inference
November 7, 2023
作者: In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, Lin Zhong
cs.AI
摘要
我们提出了Prompt Cache,这是一种加速大型语言模型(LLM)推理过程的方法,通过在不同LLM提示之间重复使用注意力状态。许多输入提示具有重叠的文本片段,例如系统消息、提示模板和提供的上下文文档。我们的关键洞察是,在推理服务器上预先计算和存储这些经常出现的文本片段的注意力状态,当这些片段出现在用户提示中时,我们可以高效地重复使用它们。Prompt Cache采用一种模式来明确定义这些可重复使用的文本片段,称为提示模块。该模式确保在注意力状态重复时的位置准确性,并为用户提供接口以访问其提示中的缓存状态。通过一个原型实现,我们评估了Prompt Cache在多个LLM上的效果。我们展示了Prompt Cache显著减少了首个标记到达时间的延迟,特别是对于基于文档的问答和推荐等较长提示。这些改进范围从GPU推理的8倍到CPU推理的60倍,同时保持输出准确性,无需对模型参数进行修改。
English
We present Prompt Cache, an approach for accelerating inference for large
language models (LLM) by reusing attention states across different LLM prompts.
Many input prompts have overlapping text segments, such as system messages,
prompt templates, and documents provided for context. Our key insight is that
by precomputing and storing the attention states of these frequently occurring
text segments on the inference server, we can efficiently reuse them when these
segments appear in user prompts. Prompt Cache employs a schema to explicitly
define such reusable text segments, called prompt modules. The schema ensures
positional accuracy during attention state reuse and provides users with an
interface to access cached states in their prompt. Using a prototype
implementation, we evaluate Prompt Cache across several LLMs. We show that
Prompt Cache significantly reduce latency in time-to-first-token, especially
for longer prompts such as document-based question answering and
recommendations. The improvements range from 8x for GPU-based inference to 60x
for CPU-based inference, all while maintaining output accuracy and without the
need for model parameter modifications.