LLMのためのドラフトベース近似推論

要旨

長文脈大規模言語モデル（LLM）の推論を最適化することは、Transformerの二次計算量と線形メモリ複雑性のため、ますます重要になっている。既存の近似手法、例えばキー・バリュー（KV）キャッシュの削除、スパースアテンション、プロンプト圧縮などは、通常、トークンやKVペアの重要性を大まかに予測することに依存している。本研究では、小さなドラフトモデルを活用して、トークンやKVペアの重要性をより正確に予測する新しい近似LLM推論フレームワークを提案する。具体的には、提案フレームワークの2つの実装を紹介する：（i）SpecKVは、ドラフト出力を活用して各KVペアの重要性を正確に評価し、より効果的なKVキャッシュ削除を実現する。（ii）SpecPCは、ドラフトモデルのアテンション活性化を利用して、重要でないプロンプトトークンを特定し、破棄する。我々の知る限り、これはドラフトモデルを近似LLM推論加速に使用する初めての研究であり、従来のロスレス推測的デコーディングを超える有用性を拡張するものである。我々の手法は、理論的および経験的分析に基づいて動機付けられ、ドラフトモデルとターゲットモデルのアテンションパターン間に強い相関があることを示す。長文脈ベンチマークでの広範な実験により、我々の手法が既存のベースラインよりも一貫して高い精度を達成しつつ、メモリ使用量、レイテンシ、スループットの改善を維持することを示す。コードはhttps://github.com/furiosa-ai/draft-based-approx-llmで公開されている。

English

Optimizing inference for long-context Large Language Models (LLMs) is increasingly important due to the quadratic compute and linear memory complexity of Transformers. Existing approximation methods, such as key-value (KV) cache dropping, sparse attention, and prompt compression, typically rely on rough predictions of token or KV pair importance. We propose a novel framework for approximate LLM inference that leverages small draft models to more accurately predict the importance of tokens and KV pairs. Specifically, we introduce two instantiations of our proposed framework: (i) SpecKV, which leverages a draft output to accurately assess the importance of each KV pair for more effective KV cache dropping, and (ii) SpecPC, which uses the draft model's attention activations to identify and discard unimportant prompt tokens. To the best of our knowledge, this is the first work to use draft models for approximate LLM inference acceleration, extending their utility beyond traditional lossless speculative decoding. We motivate our methods with theoretical and empirical analyses, and show a strong correlation between the attention patterns of draft and target models. Extensive experiments on long-context benchmarks show that our methods consistently achieve higher accuracy than existing baselines, while preserving the same improvements in memory usage, latency, and throughput. Our code is available at https://github.com/furiosa-ai/draft-based-approx-llm.

LLMのためのドラフトベース近似推論

Draft-based Approximate Inference for LLMs

要旨

Support