PowerInfer:使用消費級 GPU 快速提供大型語言模型服務
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU
December 16, 2023
作者: Yixin Song, Zeyu Mi, Haotong Xie, Haibo Chen
cs.AI
摘要
本文介紹了PowerInfer,這是一個在個人電腦(PC)上配備單個消費級GPU 的高速大型語言模型(LLM)推理引擎。PowerInfer 設計的關鍵基礎是利用LLM推理中固有的高局部性,其特點是神經元激活呈冪律分佈。該分佈表明,一小部分神經元,稱為熱神經元,在各個輸入中始終被激活,而大多數冷神經元則根據具體輸入而變化。PowerInfer 利用這樣的見解設計了一個GPU-CPU混合推理引擎:將熱激活的神經元預先加載到GPU上以實現快速訪問,而冷激活的神經元則在CPU上計算,從而顯著降低了GPU內存需求和CPU-GPU數據傳輸。PowerInfer 進一步集成了自適應預測器和神經元感知稀疏運算符,優化神經元激活和計算稀疏性的效率。評估顯示,PowerInfer 在單個 NVIDIA RTX 4090 GPU 上實現了平均每秒 13.20 個標記生成速率,最高可達每秒 29.08 個標記,跨多種LLM(包括OPT-175B)的表現,僅比頂級服務器級A100 GPU實現的速度低18%。這明顯優於 llama.cpp 高達11.69倍,同時保持模型準確性。
English
This paper introduces PowerInfer, a high-speed Large Language Model (LLM)
inference engine on a personal computer (PC) equipped with a single
consumer-grade GPU. The key underlying the design of PowerInfer is exploiting
the high locality inherent in LLM inference, characterized by a power-law
distribution in neuron activation. This distribution indicates that a small
subset of neurons, termed hot neurons, are consistently activated across
inputs, while the majority, cold neurons, vary based on specific inputs.
PowerInfer exploits such an insight to design a GPU-CPU hybrid inference
engine: hot-activated neurons are preloaded onto the GPU for fast access, while
cold-activated neurons are computed on the CPU, thus significantly reducing GPU
memory demands and CPU-GPU data transfers. PowerInfer further integrates
adaptive predictors and neuron-aware sparse operators, optimizing the
efficiency of neuron activation and computational sparsity. Evaluation shows
that PowerInfer attains an average token generation rate of 13.20 tokens/s,
with a peak of 29.08 tokens/s, across various LLMs (including OPT-175B) on a
single NVIDIA RTX 4090 GPU, only 18% lower than that achieved by a top-tier
server-grade A100 GPU. This significantly outperforms llama.cpp by up to 11.69x
while retaining model accuracy.