推測的ストリーミング：補助モデルなしでの高速LLM推論

要旨

推測的デコードは、補助的なドラフトモデルの予測に基づいて大規模なターゲット言語モデルの推論を高速化する主要な技術です。効果的ではあるものの、アプリケーション固有の設定では、高い受理率を達成するためにドラフトモデルとターゲットモデルの両方を微調整する必要がしばしばあります。下流タスクの数が増えるにつれて、これらのドラフトモデルは推論システムに大きな複雑性を加えます。本論文では、Speculative Streamingを提案します。これは、単一モデルの推測的デコード手法であり、微調整の目的を次のトークン予測から将来のn-gram予測に変更することで、ドラフト機能をターゲットモデルに統合します。Speculative Streamingは、要約、構造化クエリ、意味表現といった多様なタスクにおいて、生成品質を犠牲にすることなく、デコードを1.8倍から3.1倍高速化します。さらに、Speculative Streamingはパラメータ効率が高く、Medusaスタイルのアーキテクチャと同等またはそれ以上の高速化を達成しながら、約10000倍少ない追加パラメータを使用するため、リソースが制約されたデバイスに適しています。

English

Speculative decoding is a prominent technique to speed up the inference of a large target language model based on predictions of an auxiliary draft model. While effective, in application-specific settings, it often involves fine-tuning both draft and target models to achieve high acceptance rates. As the number of downstream tasks grows, these draft models add significant complexity to inference systems. We propose Speculative Streaming, a single-model speculative decoding method that fuses drafting into the target model by changing the fine-tuning objective from next token prediction to future n-gram prediction. Speculative Streaming speeds up decoding by 1.8 - 3.1X in a diverse set of tasks, such as Summarization, Structured Queries, and Meaning Representation, without sacrificing generation quality. Additionally, Speculative Streaming is parameter-efficient. It achieves on-par/higher speed-ups than Medusa-style architectures while using ~10000X fewer extra parameters, making it well-suited for resource-constrained devices.

推測的ストリーミング：補助モデルなしでの高速LLM推論

Speculative Streaming: Fast LLM Inference without Auxiliary Models

要旨

Support