さらに高速なLLM推論のためのカスケード型推測ドラフト

要旨

推測的デコーディングは、ドラフトモデルを活用してより大きなターゲットモデルにレビューさせることで、大規模言語モデル（LLM）の効率を向上させる。しかし、推測的デコーディングにおけるドラフト生成は、遅い自己回帰生成を伴い、異なる重要性を持つトークンを同じ時間配分で生成する。これらの非効率性が、その最適でない性能を引き起こしている。この問題に対処するため、我々はカスケード推測的ドラフト生成（CS. Drafting）を提案する。この新たなアプローチでは、2種類のカスケードを採用している。垂直カスケードは、ニューラルモデルからの自己回帰生成を排除する。水平カスケードは、理論的分析に基づく最適性を支持し、ドラフト生成における効率的な時間配分を構成する。両方のカスケードを組み合わせた我々のCS. Draftingアルゴリズムは、実験において推測的デコーディングに対して最大72％の追加高速化を達成しつつ、同じ出力分布を維持した。

English

Speculative decoding enhances the efficiency of large language models (LLMs) by leveraging a draft model to draft for a larger target model to review. However, drafting in speculative decoding involves slow autoregressive generation and generating tokens of different importance with the same time allocation. These two inefficiencies lead to its suboptimal performance. To address this issue, we introduce Cascade Speculative Drafting (CS. Drafting), a novel approach that employs two types of cascades. The Vertical Cascade eliminates autoregressive generation from neural models. The Horizontal Cascade constitutes efficient time allocation in drafting with its optimality supported by our theoretical analysis. Combining both cascades, our CS. Drafting algorithm has achieved up to 72 percent additional speedup over speculative decoding in our experiments while keeping the same output distribution.

さらに高速なLLM推論のためのカスケード型推測ドラフト

Cascade Speculative Drafting for Even Faster LLM Inference

要旨

Support