HeadInfer:通過逐頭卸載實現記憶體高效的大型語言模型推理
HeadInfer: Memory-Efficient LLM Inference by Head-wise Offloading
February 18, 2025
作者: Cheng Luo, Zefan Cai, Hanshi Sun, Jinqi Xiao, Bo Yuan, Wen Xiao, Junjie Hu, Jiawei Zhao, Beidi Chen, Anima Anandkumar
cs.AI
摘要
基於Transformer的大型語言模型(LLMs)在長上下文生成中展現出令人印象深刻的性能。擴展上下文長度已使LLMs在推理過程中的記憶體佔用不成比例地轉移到鍵值快取(KV cache)上。本文提出HEADINFER,它將KV cache卸載至CPU RAM,同時避免在GPU上完全儲存任何Transformer層的KV cache。HEADINFER採用細粒度的、基於注意力頭的卸載策略,僅在GPU上保留選擇性的注意力頭KV cache,並動態計算注意力輸出。通過屋頂線分析,我們證明HEADINFER在保持計算效率的同時,顯著減少了記憶體佔用。我們在Llama-3-8B模型上對HEADINFER進行了評估,使用100萬個token的序列,將KV cache的GPU記憶體佔用從128 GB減少到1 GB,總GPU記憶體使用量從207 GB減少到17 GB,相比於BF16基準推理實現了92%的減少。值得注意的是,HEADINFER使得在單個配備24GB記憶體的消費級GPU(例如NVIDIA RTX 4090)上,無需近似方法即可進行4百萬個token的8B模型推理。
English
Transformer-based large language models (LLMs) demonstrate impressive
performance in long context generation. Extending the context length has
disproportionately shifted the memory footprint of LLMs during inference to the
key-value cache (KV cache). In this paper, we propose HEADINFER, which offloads
the KV cache to CPU RAM while avoiding the need to fully store the KV cache for
any transformer layer on the GPU. HEADINFER employs a fine-grained, head-wise
offloading strategy, maintaining only selective attention heads KV cache on the
GPU while computing attention output dynamically. Through roofline analysis, we
demonstrate that HEADINFER maintains computational efficiency while
significantly reducing memory footprint. We evaluate HEADINFER on the
Llama-3-8B model with a 1-million-token sequence, reducing the GPU memory
footprint of the KV cache from 128 GB to 1 GB and the total GPU memory usage
from 207 GB to 17 GB, achieving a 92% reduction compared to BF16 baseline
inference. Notably, HEADINFER enables 4-million-token inference with an 8B
model on a single consumer GPU with 24GB memory (e.g., NVIDIA RTX 4090) without
approximation methods.Summary
AI-Generated Summary