ドラフトを減らし、リトリーブを増やす：投機的デコードのためのハイブリッド木構築

要旨

投機的デコーディング（SD）は、ドラフト検証パラダイムを活用することで大規模言語モデルの推論を高速化する。受入率を最大化するため、最近の手法では広範なドラフト木を構築するが、これは不幸にも深刻なVRAM帯域幅と計算オーバーヘッドを引き起こし、エンドツーエンドの高速化のボトルネックとなる。動的深さ枝刈りは、周辺的な枝を除去することでこの遅延を低減できるが、同時に潜在的に有効な候補を破棄するため、受入率が密な木の上限に達するのを妨げる。本論文では、リソース割り当てにおける重要な機会を特定する：密なドラフトから枝刈りされたドラフトへの移行により、かなりの計算予算が解放される。このパレートトレードオフを打破するために、我々はGraftを導入する。これは枝刈りと検索を相互補強操作として結合する補償フレームワークである。枝刈りは検索に十分な予算を提供し、一方検索は枝刈りによって生じたカバレッジ損失を補償し、受理された長さを回復する。逐次的な「枝刈り後にグラフト」機構を採用することで、Graftは枝刈りによって開かれた位置に予測性の高い検索トークンを付加し、トポロジー的なギャップをほぼゼロのオーバーヘッドで埋める。Graftは完全に学習不要で損失がない。包括的な評価により、Graftが短コンテキスト生成、長コンテキスト生成、大規模モデルを含む実用的なデプロイ設定全体にわたって新たなパレートフロンティアを確立することを示す。短コンテキストベンチマークでは、最大5.41倍の高速化を達成し、大規模なQwen3-235BにおいてEAGLE-3に対する平均高速化を最大21.8%向上させる。また、GraftをDFlashスタイルのブロックドラフトパラダイムに適用する予備的探求を提供し、自己回帰ドラフト木を超えたグラフトの拡張に対する初期の証拠と洞察を提示する。

English

Speculative decoding (SD) accelerates large language model inference by leveraging a draft-then-verify paradigm. To maximize the acceptance rate, recent methods construct expansive draft trees, which unfortunately incur severe VRAM bandwidth and computational overheads that bottleneck end-to-end speedups. While dynamic-depth pruning can reduce this latency by removing marginal branches, it also discards potentially valid candidates, preventing the acceptance rate from reaching the upper bound of dense trees. In this paper, we identify a critical opportunity in resource allocation: the transition from dense to pruned drafting frees up significant computational budget. To break this Pareto tradeoff, we introduce Graft, a compensation framework that couples pruning and retrieval as mutually reinforcing operations. Pruning supplies sufficient budget for retrieval, while retrieval compensates for pruning-induced coverage loss and recovers accepted length. By employing a sequential `prune-then-graft' mechanism, Graft attaches highly predictive retrieved tokens into positions opened by pruning, filling the topological gaps with near-zero overhead. Graft is entirely training-free and lossless. Comprehensive evaluations show that Graft establishes a new Pareto frontier across practical deployment settings, including short-context generation, long-context generation, and large-scale models. On short-context benchmarks, it achieves up to 5.41times speedup and improves average speedup over EAGLE-3 by up to 21.8% on the large-scale Qwen3-235B. We also provide a preliminary exploration of applying Graft to the DFlash-style block drafting paradigm, offering initial evidence and insights for extending grafting beyond autoregressive draft trees.