提示快取：模組化注意力重複使用以降低推論延遲

摘要

我們提出了 Prompt Cache，一種加速大型語言模型（LLM）推理的方法，通過在不同的LLM提示之間重複使用注意力狀態。許多輸入提示具有重疊的文本片段，例如系統消息、提示模板和提供的上下文文件。我們的關鍵洞察是，在推理伺服器上預先計算並存儲這些經常出現的文本片段的注意力狀態，當這些片段出現在用戶提示中時，我們可以有效地重複使用它們。Prompt Cache 使用一個模式來明確定義這些可重複使用的文本片段，稱為提示模塊。該模式確保在注意力狀態重複使用期間的位置準確性，並為用戶提供一個接口來訪問其提示中的緩存狀態。通過一個原型實現，我們評估了 Prompt Cache 在幾個LLM上的效果。我們展示了 Prompt Cache 顯著降低了時間到第一個標記的延遲，特別是對於基於文檔的問答和推薦等較長提示。這些改進範圍從基於GPU的推理的8倍到基於CPU的推理的60倍，同時保持輸出的準確性，並且無需對模型參數進行修改。

English

We present Prompt Cache, an approach for accelerating inference for large language models (LLM) by reusing attention states across different LLM prompts. Many input prompts have overlapping text segments, such as system messages, prompt templates, and documents provided for context. Our key insight is that by precomputing and storing the attention states of these frequently occurring text segments on the inference server, we can efficiently reuse them when these segments appear in user prompts. Prompt Cache employs a schema to explicitly define such reusable text segments, called prompt modules. The schema ensures positional accuracy during attention state reuse and provides users with an interface to access cached states in their prompt. Using a prototype implementation, we evaluate Prompt Cache across several LLMs. We show that Prompt Cache significantly reduce latency in time-to-first-token, especially for longer prompts such as document-based question answering and recommendations. The improvements range from 8x for GPU-based inference to 60x for CPU-based inference, all while maintaining output accuracy and without the need for model parameter modifications.

提示快取：模組化注意力重複使用以降低推論延遲

Prompt Cache: Modular Attention Reuse for Low-Latency Inference

摘要

Support