HeadInfer：通過逐頭卸載實現記憶體高效的大型語言模型推理

摘要

基於Transformer的大型語言模型（LLMs）在長上下文生成中展現出令人印象深刻的性能。擴展上下文長度已使LLMs在推理過程中的記憶體佔用不成比例地轉移到鍵值快取（KV cache）上。本文提出HEADINFER，它將KV cache卸載至CPU RAM，同時避免在GPU上完全儲存任何Transformer層的KV cache。HEADINFER採用細粒度的、基於注意力頭的卸載策略，僅在GPU上保留選擇性的注意力頭KV cache，並動態計算注意力輸出。通過屋頂線分析，我們證明HEADINFER在保持計算效率的同時，顯著減少了記憶體佔用。我們在Llama-3-8B模型上對HEADINFER進行了評估，使用100萬個token的序列，將KV cache的GPU記憶體佔用從128 GB減少到1 GB，總GPU記憶體使用量從207 GB減少到17 GB，相比於BF16基準推理實現了92%的減少。值得注意的是，HEADINFER使得在單個配備24GB記憶體的消費級GPU（例如NVIDIA RTX 4090）上，無需近似方法即可進行4百萬個token的8B模型推理。

English

Transformer-based large language models (LLMs) demonstrate impressive performance in long context generation. Extending the context length has disproportionately shifted the memory footprint of LLMs during inference to the key-value cache (KV cache). In this paper, we propose HEADINFER, which offloads the KV cache to CPU RAM while avoiding the need to fully store the KV cache for any transformer layer on the GPU. HEADINFER employs a fine-grained, head-wise offloading strategy, maintaining only selective attention heads KV cache on the GPU while computing attention output dynamically. Through roofline analysis, we demonstrate that HEADINFER maintains computational efficiency while significantly reducing memory footprint. We evaluate HEADINFER on the Llama-3-8B model with a 1-million-token sequence, reducing the GPU memory footprint of the KV cache from 128 GB to 1 GB and the total GPU memory usage from 207 GB to 17 GB, achieving a 92% reduction compared to BF16 baseline inference. Notably, HEADINFER enables 4-million-token inference with an 8B model on a single consumer GPU with 24GB memory (e.g., NVIDIA RTX 4090) without approximation methods.

HeadInfer：通過逐頭卸載實現記憶體高效的大型語言模型推理

HeadInfer: Memory-Efficient LLM Inference by Head-wise Offloading

摘要

Support