TriForce: 階層的推測デコーディングによる長シーケンス生成のロスレス高速化

要旨

大規模言語モデル（LLM）が長文生成に広く活用される中で、効率的な長シーケンス推論サポートに対する需要が高まっています。しかし、再計算を避けるために保存されるキー・バリュー（KV）キャッシュは、シーケンス長に比例してサイズが増大し、重要なボトルネックとなっています。LLMの自己回帰的な性質により、生成されるトークンごとにKVキャッシュ全体がロードされるため、計算コアの利用率が低く、レイテンシが高くなります。KVキャッシュの圧縮手法がいくつか提案されていますが、生成品質の低下が問題となっています。本論文では、長シーケンス生成にスケーラブルな階層的推測デコードシステム「TriForce」を紹介します。このアプローチでは、元のモデルの重みと、検索による動的スパースKVキャッシュをドラフトモデルとして活用し、階層の中間層として機能させます。さらに、より小さなモデルによる推測を行い、ドラフトのレイテンシを削減します。TriForceは、Llama2-7B-128KにおいてA100 GPU上で最大2.31倍の高速化を実現するだけでなく、さらに長いコンテキストの処理においてもスケーラビリティを発揮します。2台のRTX 4090 GPUを用いたオフロード設定では、TriForceは0.108秒/トークンを達成し、A100上の自己回帰ベースラインの半分の速度であり、最適化されたオフロードシステムでは7.78倍の性能を発揮します。また、単一のRTX 4090 GPU上では、DeepSpeed-Zero-Inferenceよりも4.86倍高速です。TriForceの堅牢性は、様々な温度設定において一貫して優れた性能を発揮することで示されています。コードはhttps://github.com/Infini-AI-Lab/TriForceで公開されています。

English

With large language models (LLMs) widely deployed in long content generation recently, there has emerged an increasing demand for efficient long-sequence inference support. However, key-value (KV) cache, which is stored to avoid re-computation, has emerged as a critical bottleneck by growing linearly in size with the sequence length. Due to the auto-regressive nature of LLMs, the entire KV cache will be loaded for every generated token, resulting in low utilization of computational cores and high latency. While various compression methods for KV cache have been proposed to alleviate this issue, they suffer from degradation in generation quality. We introduce TriForce, a hierarchical speculative decoding system that is scalable to long sequence generation. This approach leverages the original model weights and dynamic sparse KV cache via retrieval as a draft model, which serves as an intermediate layer in the hierarchy and is further speculated by a smaller model to reduce its drafting latency. TriForce not only facilitates impressive speedups for Llama2-7B-128K, achieving up to 2.31times on an A100 GPU but also showcases scalability in handling even longer contexts. For the offloading setting on two RTX 4090 GPUs, TriForce achieves 0.108s/tokenx2014only half as slow as the auto-regressive baseline on an A100, which attains 7.78times on our optimized offloading system. Additionally, TriForce performs 4.86times than DeepSpeed-Zero-Inference on a single RTX 4090 GPU. TriForce's robustness is highlighted by its consistently outstanding performance across various temperatures. The code is available at https://github.com/Infini-AI-Lab/TriForce.

TriForce: 階層的推測デコーディングによる長シーケンス生成のロスレス高速化

TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding

要旨

Support