PowerInfer:利用消费级 GPU 实现快速大型语言模型服务
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU
December 16, 2023
作者: Yixin Song, Zeyu Mi, Haotong Xie, Haibo Chen
cs.AI
摘要
本文介绍了PowerInfer,这是一个在个人电脑(PC)上配备单个消费级GPU的高速大型语言模型(LLM)推断引擎。PowerInfer设计的关键基础是利用LLM推断中固有的高局部性,其特征是神经元激活呈幂律分布。这种分布表明,一个被称为热神经元的小子集在各种输入中始终被激活,而大多数冷神经元则根据特定输入而变化。PowerInfer利用这一洞察力设计了一个GPU-CPU混合推断引擎:热激活的神经元预先加载到GPU上以实现快速访问,而冷激活的神经元则在CPU上计算,从而显著减少了GPU内存需求和CPU-GPU数据传输。PowerInfer进一步集成了自适应预测器和神经元感知稀疏运算符,优化神经元激活和计算稀疏性的效率。评估结果显示,PowerInfer在单个NVIDIA RTX 4090 GPU上的各种LLMs(包括OPT-175B)上实现了平均每秒生成13.20个标记的速率,峰值为29.08个标记/秒,仅比顶级服务器级A100 GPU实现的速率低18%。这大大优于llama.cpp高达11.69倍,同时保持模型准确性。
English
This paper introduces PowerInfer, a high-speed Large Language Model (LLM)
inference engine on a personal computer (PC) equipped with a single
consumer-grade GPU. The key underlying the design of PowerInfer is exploiting
the high locality inherent in LLM inference, characterized by a power-law
distribution in neuron activation. This distribution indicates that a small
subset of neurons, termed hot neurons, are consistently activated across
inputs, while the majority, cold neurons, vary based on specific inputs.
PowerInfer exploits such an insight to design a GPU-CPU hybrid inference
engine: hot-activated neurons are preloaded onto the GPU for fast access, while
cold-activated neurons are computed on the CPU, thus significantly reducing GPU
memory demands and CPU-GPU data transfers. PowerInfer further integrates
adaptive predictors and neuron-aware sparse operators, optimizing the
efficiency of neuron activation and computational sparsity. Evaluation shows
that PowerInfer attains an average token generation rate of 13.20 tokens/s,
with a peak of 29.08 tokens/s, across various LLMs (including OPT-175B) on a
single NVIDIA RTX 4090 GPU, only 18% lower than that achieved by a top-tier
server-grade A100 GPU. This significantly outperforms llama.cpp by up to 11.69x
while retaining model accuracy.