SparQ Attention: 대역폭 효율적인 LLM 추론

초록

생성형 대규모 언어 모델(LLMs)은 수많은 새로운 가능성을 열어주었지만, 상당한 컴퓨팅 자원을 요구하기 때문에 보편적인 사용에는 여전히 어려움이 있습니다. 특히 가장 유용한 애플리케이션 중 일부는 한 번에 대량의 샘플을 처리하고 긴 문맥을 사용해야 하며, 이는 모델의 메모리 통신 부하를 크게 증가시킵니다. 우리는 SparQ Attention이라는 기술을 소개합니다. 이 기술은 캐시된 이력 데이터를 선택적으로 가져옴으로써 어텐션 블록 내의 메모리 대역폭 요구량을 줄이고, LLM의 추론 처리량을 증가시킵니다. 우리가 제안한 이 기술은 사전 학습 설정을 변경하거나 추가적인 미세 조정 없이도 추론 과정에서 기존의 상용 LLM에 직접 적용할 수 있습니다. Llama 2와 Pythia 모델을 다양한 다운스트림 작업에서 평가함으로써, SparQ Attention이 정확도 손실 없이 어텐션 메모리 대역폭 요구량을 최대 8배까지 감소시킬 수 있음을 보여줍니다.

English

Generative large language models (LLMs) have opened up numerous novel possibilities, but due to their significant computational requirements their ubiquitous use remains challenging. Some of the most useful applications require processing large numbers of samples at a time and using long contexts, both significantly increasing the memory communication load of the models. We introduce SparQ Attention, a technique for increasing the inference throughput of LLMs by reducing the memory bandwidth requirements within the attention blocks through selective fetching of the cached history. Our proposed technique can be applied directly to off-the-shelf LLMs during inference, without requiring any modification to the pre-training setup or additional fine-tuning. We show how SparQ Attention can decrease the attention memory bandwidth requirements up to eight times without any loss in accuracy by evaluating Llama 2 and Pythia models on a wide range of downstream tasks.

SparQ Attention: 대역폭 효율적인 LLM 추론

SparQ Attention: Bandwidth-Efficient LLM Inference

초록

Support