PowerInfer-2: スマートフォン上での高速大規模言語モデル推論

要旨

本論文では、スマートフォン上での大規模言語モデル（LLM）の高速推論を実現するフレームワークであるPowerInfer-2を紹介します。特に、デバイスのメモリ容量を超えるサイズのモデルに対して効果的です。PowerInfer-2の鍵となる洞察は、スマートフォンの異種計算、メモリ、およびI/Oリソースを活用するために、従来の行列計算を細粒度のニューロンクラスタ計算に分解することです。具体的には、PowerInfer-2は、LLM推論の各段階に適応する多態性ニューロンエンジンを特徴としています。さらに、セグメント化されたニューロンキャッシュと細粒度のニューロンクラスタレベルのパイプラインを導入し、I/O操作によるオーバーヘッドを効果的に最小化および隠蔽します。PowerInfer-2の実装と評価により、2つのスマートフォン上で幅広いLLMモデルをサポートし、最先端のフレームワークと比較して最大29.2倍の速度向上を達成できることが示されました。特に、PowerInfer-2は、スマートフォン上でTurboSparse-Mixtral-47Bモデルを11.68トークン/秒の生成速度で提供する初のシステムです。メモリに完全に収まるモデルでは、PowerInfer-2はllama.cppやMLC-LLMと同等の推論速度を維持しながら、メモリ使用量を約40%削減できます。詳細やデモ動画については、プロジェクトサイトwww.powerinfer.ai/v2をご覧ください。

English

This paper introduces PowerInfer-2, a framework designed for high-speed inference of Large Language Models (LLMs) on smartphones, particularly effective for models whose sizes exceed the device's memory capacity. The key insight of PowerInfer-2 is to utilize the heterogeneous computation, memory, and I/O resources in smartphones by decomposing traditional matrix computations into fine-grained neuron cluster computations. Specifically, PowerInfer-2 features a polymorphic neuron engine that adapts computational strategies for various stages of LLM inference. Additionally, it introduces segmented neuron caching and fine-grained neuron-cluster-level pipelining, which effectively minimize and conceal the overhead caused by I/O operations. The implementation and evaluation of PowerInfer-2 demonstrate its capability to support a wide array of LLM models on two smartphones, achieving up to a 29.2x speed increase compared with state-of-the-art frameworks. Notably, PowerInfer-2 is the first system to serve the TurboSparse-Mixtral-47B model with a generation rate of 11.68 tokens per second on a smartphone. For models that fit entirely within the memory, PowerInfer-2 can achieve approximately a 40% reduction in memory usage while maintaining inference speeds comparable to llama.cpp and MLC-LLM. For more details, including a demonstration video, please visit the project site at www.powerinfer.ai/v2.

PowerInfer-2: スマートフォン上での高速大規模言語モデル推論

PowerInfer-2: Fast Large Language Model Inference on a Smartphone

要旨

Support