XC-Cache：用于高效LLM推理的跨缓存上下文关注

摘要

在上下文学习（ICL）方法中，通常利用提示来使仅解码器语言模型生成基于参考信息的结果。由于自注意力操作的二次成本，及时处理上下文效率低下，因此缓存是可取的。然而，缓存变压器状态很容易需要的空间几乎与模型参数一样多。当事先不知道正确的上下文时，缓存ICL可能具有挑战性。本研究通过引入受编码器-解码器架构启发的模型来解决这些限制，该模型使用交叉注意力来使生成结果依赖于参考文本而无需提示。更确切地说，我们利用预训练的仅解码器模型，仅训练少量添加层。我们以问答（QA）作为测试平台来评估我们的模型进行条件生成的能力，并观察到它们优于ICL，与微调提示LLM相当，并且相对于标准KV缓存大幅减少了空间占用，降低了两个数量级。

English

In-context learning (ICL) approaches typically leverage prompting to condition decoder-only language model generation on reference information. Just-in-time processing of a context is inefficient due to the quadratic cost of self-attention operations, and caching is desirable. However, caching transformer states can easily require almost as much space as the model parameters. When the right context isn't known in advance, caching ICL can be challenging. This work addresses these limitations by introducing models that, inspired by the encoder-decoder architecture, use cross-attention to condition generation on reference text without the prompt. More precisely, we leverage pre-trained decoder-only models and only train a small number of added layers. We use Question-Answering (QA) as a testbed to evaluate the ability of our models to perform conditional generation and observe that they outperform ICL, are comparable to fine-tuned prompted LLMs, and drastically reduce the space footprint relative to standard KV caching by two orders of magnitude.

XC-Cache：用于高效LLM推理的跨缓存上下文关注

XC-Cache: Cross-Attending to Cached Context for Efficient LLM Inference

摘要

Support