LeanK: 効率的なデコーディングのための学習可能なKキャッシュチャネルプルーニング

要旨

大規模言語モデル（LLMs）は長文脈タスクを可能にするが、キー・バリュー（KV）キャッシュの増大に伴う効率性の課題に直面している。本論文では、静的チャネルスパース性を活用して重要でないキー（K）キャッシュチャネルを剪定する学習ベースの手法「LeanK」を提案する。新規の2段階トレーニングプロセスにより、LeanKは特定のスパース率とハードウェアアライメント要件を満たすチャネル単位の静的マスクを学習する。LeanKはGPUメモリを削減し、精度を犠牲にすることなくデコードを高速化する。実験では、最大70%のKキャッシュと16%-18%のVキャッシュメモリ削減を実証している。カスタムデコードカーネルにより、アテンション計算が1.3倍高速化される。また、学習された重要度分布を分析することで、長文脈推論中のモデルチャネルとアテンションヘッドに関する洞察を提供する。コードはhttps://aka.ms/LeanKで公開されている。

English

Large language models (LLMs) enable long-context tasks but face efficiency challenges due to the growing key-value (KV) cache. We propose LeanK, a learning-based method that prunes unimportant key (K) cache channels by leveraging static channel sparsity. With a novel two-stage training process, LeanK learns channel-wise static mask that could satisfy specific sparsity ratio and hardware alignment requirement. LeanK reduces GPU memory and accelerates decoding without sacrificing accuracy. Experiments demonstrate up to 70% K cache and 16%-18% V cache memory reduction. Custom decoding kernel enables 1.3x speedup for attention computation. We also provide insights into model channels and attention heads during long-context inference by analyzing the learned importance distribution. Our code is available at https://aka.ms/LeanK.

LeanK: 効率的なデコーディングのための学習可能なKキャッシュチャネルプルーニング

LeanK: Learnable K Cache Channel Pruning for Efficient Decoding

要旨

Support