Prompt Cache: 低遅延推論のためのモジュラーアテンション再利用

要旨

本論文では、Prompt Cacheというアプローチを提案します。これは、大規模言語モデル（LLM）の推論を高速化するために、異なるLLMプロンプト間でアテンション状態を再利用する手法です。多くの入力プロンプトには、システムメッセージ、プロンプトテンプレート、文脈として提供されるドキュメントなど、重複するテキストセグメントが存在します。私たちの重要な洞察は、これらの頻繁に出現するテキストセグメントのアテンション状態を推論サーバー上で事前計算して保存することで、ユーザープロンプト内でこれらのセグメントが出現した際に効率的に再利用できるという点です。Prompt Cacheは、プロンプトモジュールと呼ばれる再利用可能なテキストセグメントを明示的に定義するスキーマを採用しています。このスキーマは、アテンション状態の再利用時に位置精度を保証し、ユーザーがキャッシュされた状態にアクセスするためのインターフェースを提供します。プロトタイプ実装を用いて、複数のLLMに対してPrompt Cacheを評価しました。その結果、Prompt Cacheが、特にドキュメントベースの質問応答やレコメンデーションなどの長いプロンプトにおいて、最初のトークンまでの待ち時間を大幅に短縮することが示されました。GPUベースの推論では8倍、CPUベースの推論では60倍の改善が見られ、出力精度を維持しつつ、モデルパラメータの変更を必要としませんでした。

English

We present Prompt Cache, an approach for accelerating inference for large language models (LLM) by reusing attention states across different LLM prompts. Many input prompts have overlapping text segments, such as system messages, prompt templates, and documents provided for context. Our key insight is that by precomputing and storing the attention states of these frequently occurring text segments on the inference server, we can efficiently reuse them when these segments appear in user prompts. Prompt Cache employs a schema to explicitly define such reusable text segments, called prompt modules. The schema ensures positional accuracy during attention state reuse and provides users with an interface to access cached states in their prompt. Using a prototype implementation, we evaluate Prompt Cache across several LLMs. We show that Prompt Cache significantly reduce latency in time-to-first-token, especially for longer prompts such as document-based question answering and recommendations. The improvements range from 8x for GPU-based inference to 60x for CPU-based inference, all while maintaining output accuracy and without the need for model parameter modifications.

Prompt Cache: 低遅延推論のためのモジュラーアテンション再利用

Prompt Cache: Modular Attention Reuse for Low-Latency Inference

要旨

Support