PowerInfer：使用消費級 GPU 快速提供大型語言模型服務

摘要

本文介紹了PowerInfer，這是一個在個人電腦（PC）上配備單個消費級GPU 的高速大型語言模型（LLM）推理引擎。PowerInfer 設計的關鍵基礎是利用LLM推理中固有的高局部性，其特點是神經元激活呈冪律分佈。該分佈表明，一小部分神經元，稱為熱神經元，在各個輸入中始終被激活，而大多數冷神經元則根據具體輸入而變化。PowerInfer 利用這樣的見解設計了一個GPU-CPU混合推理引擎：將熱激活的神經元預先加載到GPU上以實現快速訪問，而冷激活的神經元則在CPU上計算，從而顯著降低了GPU內存需求和CPU-GPU數據傳輸。PowerInfer 進一步集成了自適應預測器和神經元感知稀疏運算符，優化神經元激活和計算稀疏性的效率。評估顯示，PowerInfer 在單個 NVIDIA RTX 4090 GPU 上實現了平均每秒 13.20 個標記生成速率，最高可達每秒 29.08 個標記，跨多種LLM（包括OPT-175B）的表現，僅比頂級服務器級A100 GPU實現的速度低18%。這明顯優於 llama.cpp 高達11.69倍，同時保持模型準確性。

English

This paper introduces PowerInfer, a high-speed Large Language Model (LLM) inference engine on a personal computer (PC) equipped with a single consumer-grade GPU. The key underlying the design of PowerInfer is exploiting the high locality inherent in LLM inference, characterized by a power-law distribution in neuron activation. This distribution indicates that a small subset of neurons, termed hot neurons, are consistently activated across inputs, while the majority, cold neurons, vary based on specific inputs. PowerInfer exploits such an insight to design a GPU-CPU hybrid inference engine: hot-activated neurons are preloaded onto the GPU for fast access, while cold-activated neurons are computed on the CPU, thus significantly reducing GPU memory demands and CPU-GPU data transfers. PowerInfer further integrates adaptive predictors and neuron-aware sparse operators, optimizing the efficiency of neuron activation and computational sparsity. Evaluation shows that PowerInfer attains an average token generation rate of 13.20 tokens/s, with a peak of 29.08 tokens/s, across various LLMs (including OPT-175B) on a single NVIDIA RTX 4090 GPU, only 18% lower than that achieved by a top-tier server-grade A100 GPU. This significantly outperforms llama.cpp by up to 11.69x while retaining model accuracy.

PowerInfer：使用消費級 GPU 快速提供大型語言模型服務

PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU

摘要

Support