PowerInfer：利用消费级 GPU 实现快速大型语言模型服务

摘要

本文介绍了PowerInfer，这是一个在个人电脑(PC)上配备单个消费级GPU的高速大型语言模型(LLM)推断引擎。PowerInfer设计的关键基础是利用LLM推断中固有的高局部性，其特征是神经元激活呈幂律分布。这种分布表明，一个被称为热神经元的小子集在各种输入中始终被激活，而大多数冷神经元则根据特定输入而变化。PowerInfer利用这一洞察力设计了一个GPU-CPU混合推断引擎：热激活的神经元预先加载到GPU上以实现快速访问，而冷激活的神经元则在CPU上计算，从而显著减少了GPU内存需求和CPU-GPU数据传输。PowerInfer进一步集成了自适应预测器和神经元感知稀疏运算符，优化神经元激活和计算稀疏性的效率。评估结果显示，PowerInfer在单个NVIDIA RTX 4090 GPU上的各种LLMs(包括OPT-175B)上实现了平均每秒生成13.20个标记的速率，峰值为29.08个标记/秒，仅比顶级服务器级A100 GPU实现的速率低18%。这大大优于llama.cpp高达11.69倍，同时保持模型准确性。

English

This paper introduces PowerInfer, a high-speed Large Language Model (LLM) inference engine on a personal computer (PC) equipped with a single consumer-grade GPU. The key underlying the design of PowerInfer is exploiting the high locality inherent in LLM inference, characterized by a power-law distribution in neuron activation. This distribution indicates that a small subset of neurons, termed hot neurons, are consistently activated across inputs, while the majority, cold neurons, vary based on specific inputs. PowerInfer exploits such an insight to design a GPU-CPU hybrid inference engine: hot-activated neurons are preloaded onto the GPU for fast access, while cold-activated neurons are computed on the CPU, thus significantly reducing GPU memory demands and CPU-GPU data transfers. PowerInfer further integrates adaptive predictors and neuron-aware sparse operators, optimizing the efficiency of neuron activation and computational sparsity. Evaluation shows that PowerInfer attains an average token generation rate of 13.20 tokens/s, with a peak of 29.08 tokens/s, across various LLMs (including OPT-175B) on a single NVIDIA RTX 4090 GPU, only 18% lower than that achieved by a top-tier server-grade A100 GPU. This significantly outperforms llama.cpp by up to 11.69x while retaining model accuracy.

PowerInfer：利用消费级 GPU 实现快速大型语言模型服务

PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU

摘要

Support