SparQ 注意力：带宽高效的LLM推断

摘要

生成式大型语言模型（LLMs）已经开辟了许多新的可能性，但由于其巨大的计算需求，它们的普遍应用仍然具有挑战性。一些最有用的应用需要一次处理大量样本并使用长上下文，这两者都显著增加了模型的内存通信负载。我们引入了SparQ注意力，这是一种通过选择性提取缓存历史来减少注意力块内存带宽需求的技术，从而提高LLMs的推理吞吐量。我们提出的技术可以直接应用于推理过程中的现成LLMs，无需修改预训练设置或进行额外微调。我们展示了如何通过在广泛的下游任务上评估Llama 2和Pythia模型，SparQ注意力可以将注意力内存带宽需求降低多达八倍，而不会损失准确性。

English

Generative large language models (LLMs) have opened up numerous novel possibilities, but due to their significant computational requirements their ubiquitous use remains challenging. Some of the most useful applications require processing large numbers of samples at a time and using long contexts, both significantly increasing the memory communication load of the models. We introduce SparQ Attention, a technique for increasing the inference throughput of LLMs by reducing the memory bandwidth requirements within the attention blocks through selective fetching of the cached history. Our proposed technique can be applied directly to off-the-shelf LLMs during inference, without requiring any modification to the pre-training setup or additional fine-tuning. We show how SparQ Attention can decrease the attention memory bandwidth requirements up to eight times without any loss in accuracy by evaluating Llama 2 and Pythia models on a wide range of downstream tasks.

SparQ 注意力：带宽高效的LLM推断

SparQ Attention: Bandwidth-Efficient LLM Inference

摘要

Support