XC-Cache:為了有效的LLM推論而跨越關注快取上下文
XC-Cache: Cross-Attending to Cached Context for Efficient LLM Inference
April 23, 2024
作者: João Monteiro, Étienne Marcotte, Pierre-André Noël, Valentina Zantedeschi, David Vázquez, Nicolas Chapados, Christopher Pal, Perouz Taslakian
cs.AI
摘要
在上下文學習(ICL)方法中,通常利用提示來條件化僅使用解碼器的語言模型生成參考資訊。由於自注意力操作的二次成本,對上下文進行即時處理效率低下,因此緩存變得更為理想。然而,緩存變換器狀態很容易需要的空間幾乎與模型參數一樣多。當事先不知道正確上下文時,對ICL進行緩存可能會面臨挑戰。本研究通過引入受到編碼器-解碼器架構啟發的模型,解決了這些限制,該模型使用交叉注意力來條件化生成參考文本,而無需提示。更確切地說,我們利用預訓練的僅解碼器模型,僅訓練了少量添加的層。我們使用問答(QA)作為評估我們模型執行條件生成能力的基準,並觀察到它們優於ICL,與微調提示的LLM相當,並且相對於標準KV緩存大幅減少了空間佔用量,降低了兩個數量級。
English
In-context learning (ICL) approaches typically leverage prompting to
condition decoder-only language model generation on reference information.
Just-in-time processing of a context is inefficient due to the quadratic cost
of self-attention operations, and caching is desirable. However, caching
transformer states can easily require almost as much space as the model
parameters. When the right context isn't known in advance, caching ICL can be
challenging. This work addresses these limitations by introducing models that,
inspired by the encoder-decoder architecture, use cross-attention to condition
generation on reference text without the prompt. More precisely, we leverage
pre-trained decoder-only models and only train a small number of added layers.
We use Question-Answering (QA) as a testbed to evaluate the ability of our
models to perform conditional generation and observe that they outperform ICL,
are comparable to fine-tuned prompted LLMs, and drastically reduce the space
footprint relative to standard KV caching by two orders of magnitude.Summary
AI-Generated Summary