SparQ 注意力：節省頻寬的LLM 推論

摘要

生成式大型語言模型（LLMs）已開啟許多新的可能性，但由於其龐大的計算需求，廣泛應用仍具挑戰性。一些最有用的應用需要一次處理大量樣本並使用長上下文，這兩者都顯著增加了模型的記憶通訊負載。我們介紹了SparQ Attention，這是一種通過選擇性提取緩存歷史來減少注意力塊內存寬需求的技術，以增加LLMs的推論吞吐量。我們提出的技術可以直接應用於推論過程中的現成LLMs，而無需對預訓練設置進行任何修改或進行額外的微調。通過在各種下游任務上評估Llama 2和Pythia模型，我們展示了SparQ Attention如何能夠將注意力記憶寬需求降低多達八倍，而不會導致準確性下降。

English

Generative large language models (LLMs) have opened up numerous novel possibilities, but due to their significant computational requirements their ubiquitous use remains challenging. Some of the most useful applications require processing large numbers of samples at a time and using long contexts, both significantly increasing the memory communication load of the models. We introduce SparQ Attention, a technique for increasing the inference throughput of LLMs by reducing the memory bandwidth requirements within the attention blocks through selective fetching of the cached history. Our proposed technique can be applied directly to off-the-shelf LLMs during inference, without requiring any modification to the pre-training setup or additional fine-tuning. We show how SparQ Attention can decrease the attention memory bandwidth requirements up to eight times without any loss in accuracy by evaluating Llama 2 and Pythia models on a wide range of downstream tasks.

SparQ 注意力：節省頻寬的LLM 推論

SparQ Attention: Bandwidth-Efficient LLM Inference

摘要

Support