DuoDecoding：ハードウェアを考慮した異種推測デコーディングと動的マルチシーケンスドラフト生成

要旨

大規模言語モデル（LLM）は多岐にわたるタスクで優れた性能を発揮するが、トークンごとの自己回帰的生成プロセスが推論速度を著しく低下させる。投機的デコードは、出力分布の忠実性を維持しつつ生成遅延を削減する有望なドラフト・検証フレームワークを提供する。しかし、ドラフトモデルは追加の計算オーバーヘッドを導入し、性能のボトルネックとなり、最初のトークンまでの時間（TTFT）を増大させる。これまで、ドラフトモデルのオーバーヘッドを軽減するアプローチは主にヒューリスティックに依存しており、ドラフト言語モデルの品質に匹敵するものはほとんどなかった。これらの課題に対処するため、我々はDuoDecodingを提案する。これは、ドラフトモデルとターゲットモデルをそれぞれCPUとGPUに戦略的に配置し、ドラフト品質を維持しながら並列デコードを可能にする新規アプローチである。本手法は、ハードウェアを考慮した最適なドラフト予算を組み込み、アイドル時間を最小化し、動的なマルチシーケンスドラフトングによりドラフト品質を向上させる。7つのタスクにわたる広範な実験により、DuoDecodingは生成遅延を最大2.61倍高速化し、TTFTを従来の投機的デコードの83%に削減できることを示した。コードはhttps://github.com/KaiLv69/DuoDecodingで公開されている。

English

Large language models (LLMs) exhibit exceptional performance across a wide range of tasks; however, their token-by-token autoregressive generation process significantly hinders inference speed. Speculative decoding presents a promising draft-then-verify framework that reduces generation latency while maintaining output distribution fidelity. Nevertheless, the draft model introduces additional computational overhead, becoming a performance bottleneck and increasing the time to first token (TTFT). Previous approaches to mitigate draft model overhead have primarily relied on heuristics and generally failed to match the quality of the draft language models. To address these challenges, we propose DuoDecoding, a novel approach that strategically deploys the draft and target models on the CPU and GPU respectively, enabling parallel decoding while preserving draft quality. Our method incorporates a hardware-aware optimal draft budget to minimize idle times and employs dynamic multi-sequence drafting to enhance draft quality. Extensive experiments across seven tasks show that DuoDecoding achieves up to 2.61x speedup in generation latency, while reducing TTFT to 83% of that in conventional speculative decoding. The Code is available at https://github.com/KaiLv69/DuoDecoding.

DuoDecoding：ハードウェアを考慮した異種推測デコーディングと動的マルチシーケンスドラフト生成

DuoDecoding: Hardware-aware Heterogeneous Speculative Decoding with Dynamic Multi-Sequence Drafting

要旨

Support