SpecBlock: 動的木ドラフトを用いたブロック反復型投機的デコード

要旨

投機的デコーディングは、候補となる継続のツリーを構築し、それを1回のターゲット前方処理で検証することでLLM推論を高速化する。既存のドラフターは、相反する弱点を持つ2つの阵营に分類される。EAGLE-3のような自己回帰型ドラフターは、各ドラフト経路に沿った依存関係を保持するが、ツリーの深さごとにドラフターを1回呼び出すため、ドラフティングがイテレーションごとのレイテンシのかなりの部分を占める。並列型ドラフターは、1回の前方処理で複数の未来位置を予測することでドラフター呼び出しを削減するが、各位置は他の位置を参照せずに予測されるため、検証器が拒否する経路が生じる。本論文では、経路依存性と低コストなドラフティングを組み合わせたブロック反復型ドラフターであるSpecBlockを提案する。各ドラフター前方処理はK個の依存関係にある位置を生成し、これをブロックと呼ぶ。ドラフトツリーはブロックの反復的な拡張によって成長する。2つのメカニズムが明示的に経路依存性を伝達し、後続のドラフト位置の精度を維持する。各ブロック内では、層ごとのシフトにより、前の位置の隠れ状態がすべてのデコーダ層に引き継がれる。ブロック間では、新しいブロックは前のブロックの任意の位置から開始でき、その隠れ状態を継承して経路を延長する。受理される可能性が高い箇所に検証器の予算を割り当てるため、共訓練されたランクヘッドが固定トップk木を置き換え、ドラフティング時に位置ごとの分岐を割り当てる。推論時にドラフターが決して生成しないプレフィックスで訓練することを避けるため、有効プレフィックスマスクにより、前の位置が誤った場合に後続の位置での損失を無視する。静的ドラフティングに加えて、デプロイ時におけるコスト考慮型バンディットは、無料の検証器フィードバックを利用して、期待されるスループット向上が更新コストを上回る場合にのみ、選択的にドラフターを更新する。実験により、SpecBlockはEAGLE-3と比較して、ドラフティングコストが44〜52%であるにもかかわらず、平均速度向上を8〜13%改善し、コスト考慮適応によりこの差を11〜19%に拡大することを示す。

English

Speculative decoding accelerates LLM inference by drafting a tree of candidate continuations and verifying it in one target forward. Existing drafters fall into two camps with opposite weaknesses. Autoregressive drafters such as EAGLE-3 preserve dependence along each draft path but call the drafter once per tree depth, making drafting a non-trivial share of per-iteration latency. Parallel drafters cut drafter calls by predicting multiple future positions in one forward, but each position is predicted without seeing the others, producing paths the verifier rejects. In this paper, we propose SpecBlock, a block-iterative drafter that combines path dependence with cheap drafting. Each drafter forward produces K dependent positions and we call this a block. The draft tree grows through repeated block expansions. Two mechanisms explicitly carry path dependence to keep later draft positions accurate. Within each block, a layer-wise shift carries the previous position's hidden state into every decoder layer. Across blocks, each new block can start from any position of the previous block, inheriting its hidden state to extend the path. To spend verifier budget where acceptance is likely, a co-trained rank head replaces the fixed top-k tree by allocating per-position branching during drafting. To avoid training the drafter on prefixes it never produces at inference, a valid-prefix mask drops the loss at later positions once an earlier one is wrong. Beyond static drafting, a cost-aware bandit at deployment uses free verifier feedback to update the drafter selectively, only when the expected throughput gain exceeds the update cost. Experiments show that SpecBlock improves mean speedup by 8-13% over EAGLE-3 at 44-52% of its drafting cost, and cost-aware adaptation extends this lead to 11-19%.

SpecBlock: 動的木ドラフトを用いたブロック反復型投機的デコード

SpecBlock: Block-Iterative Speculative Decoding with Dynamic Tree Drafting

要旨

Support