PowerInfer-2:智能手机上快速大型语言模型推理
PowerInfer-2: Fast Large Language Model Inference on a Smartphone
June 10, 2024
作者: Zhenliang Xue, Yixin Song, Zeyu Mi, Le Chen, Yubin Xia, Haibo Chen
cs.AI
摘要
本文介绍了PowerInfer-2,这是一个专为智能手机上大型语言模型(LLMs)进行高速推断而设计的框架,特别适用于模型大小超过设备内存容量的情况。PowerInfer-2的关键见解在于通过将传统矩阵计算分解为细粒度神经元簇计算,利用智能手机中的异构计算、内存和I/O资源。具体而言,PowerInfer-2具有一个多态神经元引擎,可为LLM推断的各个阶段自适应计算策略。此外,它引入了分段神经元缓存和细粒度神经元簇级流水线技术,有效地减少并隐藏了I/O操作带来的开销。PowerInfer-2的实现和评估展示了其支持多种LLM模型的能力,可在两款智能手机上实现高达29.2倍的速度提升,相比于最先进的框架。值得注意的是,PowerInfer-2是第一个在智能手机上以每秒11.68个标记的速率为TurboSparse-Mixtral-47B模型提供服务的系统。对于完全适合内存的模型,PowerInfer-2可以实现大约40%的内存使用减少,同时保持与llama.cpp和MLC-LLM相当的推断速度。欲了解更多详情,包括演示视频,请访问项目网站www.powerinfer.ai/v2。
English
This paper introduces PowerInfer-2, a framework designed for high-speed
inference of Large Language Models (LLMs) on smartphones, particularly
effective for models whose sizes exceed the device's memory capacity. The key
insight of PowerInfer-2 is to utilize the heterogeneous computation, memory,
and I/O resources in smartphones by decomposing traditional matrix computations
into fine-grained neuron cluster computations. Specifically, PowerInfer-2
features a polymorphic neuron engine that adapts computational strategies for
various stages of LLM inference. Additionally, it introduces segmented neuron
caching and fine-grained neuron-cluster-level pipelining, which effectively
minimize and conceal the overhead caused by I/O operations. The implementation
and evaluation of PowerInfer-2 demonstrate its capability to support a wide
array of LLM models on two smartphones, achieving up to a 29.2x speed increase
compared with state-of-the-art frameworks. Notably, PowerInfer-2 is the first
system to serve the TurboSparse-Mixtral-47B model with a generation rate of
11.68 tokens per second on a smartphone. For models that fit entirely within
the memory, PowerInfer-2 can achieve approximately a 40% reduction in memory
usage while maintaining inference speeds comparable to llama.cpp and MLC-LLM.
For more details, including a demonstration video, please visit the project
site at www.powerinfer.ai/v2.Summary
AI-Generated Summary