SparQ Attention: 帯域幅効率の高いLLM推論

要旨

生成型大規模言語モデル（LLMs）は数多くの新たな可能性を開拓しましたが、その膨大な計算リソース要件のため、広範な利用は依然として課題となっています。特に有用なアプリケーションの一部では、一度に大量のサンプルを処理し、長いコンテキストを使用する必要があり、これらはモデルのメモリ通信負荷を大幅に増加させます。本論文では、SparQ Attentionを紹介します。これは、アテンションブロック内のメモリ帯域幅要件を、キャッシュされた履歴の選択的フェッチによって削減し、LLMsの推論スループットを向上させる技術です。提案手法は、推論時に既存のLLMsに直接適用可能であり、事前学習の設定変更や追加のファインチューニングを必要としません。Llama 2およびPythiaモデルを幅広い下流タスクで評価することで、SparQ Attentionが精度を損なうことなくアテンションメモリ帯域幅要件を最大8倍削減できることを示します。

English

Generative large language models (LLMs) have opened up numerous novel possibilities, but due to their significant computational requirements their ubiquitous use remains challenging. Some of the most useful applications require processing large numbers of samples at a time and using long contexts, both significantly increasing the memory communication load of the models. We introduce SparQ Attention, a technique for increasing the inference throughput of LLMs by reducing the memory bandwidth requirements within the attention blocks through selective fetching of the cached history. Our proposed technique can be applied directly to off-the-shelf LLMs during inference, without requiring any modification to the pre-training setup or additional fine-tuning. We show how SparQ Attention can decrease the attention memory bandwidth requirements up to eight times without any loss in accuracy by evaluating Llama 2 and Pythia models on a wide range of downstream tasks.

SparQ Attention: 帯域幅効率の高いLLM推論

SparQ Attention: Bandwidth-Efficient LLM Inference

要旨

Support