XC-Cache: 効率的なLLM推論のためのキャッシュされたコンテキストへのクロスアテンション

要旨

インコンテキスト学習（ICL）アプローチでは、通常、プロンプティングを活用してデコーダー専用言語モデルの生成を参照情報に基づいて条件付けます。コンテキストのジャストインタイム処理は、セルフアテンション操作の二次コストのため非効率的であり、キャッシュが望ましいです。しかし、トランスフォーマーの状態をキャッシュすると、モデルパラメータとほぼ同程度のスペースを容易に必要とします。適切なコンテキストが事前にわからない場合、ICLのキャッシュは困難です。本研究では、これらの制限に対処するため、エンコーダー-デコーダーアーキテクチャに着想を得たモデルを導入し、プロンプトなしで参照テキストに基づく生成を条件付けるためにクロスアテンションを使用します。より正確には、事前学習済みのデコーダー専用モデルを活用し、追加された少数の層のみを訓練します。条件付き生成の能力を評価するためのテストベッドとして質問応答（QA）を使用し、我々のモデルがICLを上回り、ファインチューニングされたプロンプト付きLLMに匹敵し、標準的なKVキャッシュに比べてスペースフットプリントを2桁削減することを観察しました。

English

In-context learning (ICL) approaches typically leverage prompting to condition decoder-only language model generation on reference information. Just-in-time processing of a context is inefficient due to the quadratic cost of self-attention operations, and caching is desirable. However, caching transformer states can easily require almost as much space as the model parameters. When the right context isn't known in advance, caching ICL can be challenging. This work addresses these limitations by introducing models that, inspired by the encoder-decoder architecture, use cross-attention to condition generation on reference text without the prompt. More precisely, we leverage pre-trained decoder-only models and only train a small number of added layers. We use Question-Answering (QA) as a testbed to evaluate the ability of our models to perform conditional generation and observe that they outperform ICL, are comparable to fine-tuned prompted LLMs, and drastically reduce the space footprint relative to standard KV caching by two orders of magnitude.

XC-Cache: 効率的なLLM推論のためのキャッシュされたコンテキストへのクロスアテンション

XC-Cache: Cross-Attending to Cached Context for Efficient LLM Inference

要旨

Support