ブロック拡散ドラフトツリーによる投機的デコーディングの高速化

要旨

推測的デコードは、軽量なドラフターが将来の複数トークンを提案し、ターゲットモデルがそれを並列検証することで、自己回帰型言語モデルの推論を加速する手法である。DFlashは、ブロック拡散ドラフターが単一のフォワードパスで草案ブロック全体を生成し、EAGLE-3のような強力な自己回帰型ドラフターを凌駕する、状態-of-the-artの推測的デコード性能を達成できることを実証した。しかし、従来のDFlashはラウンドごとに単一の草案軌跡しか検証せず、受理長が制限される可能性があった。本論文では、ブロック拡散ドラフターの位置毎の分布から直接草案ツリーを構築するDDTree（Diffusion Draft Tree）を提案する。固定ノード予算の下で、DDTreeは単純な最良優先ヒープアルゴリズムを用い、草案モデルの出力で定義された代理指標に基づきターゲットモデルとの一致可能性が最も高い継続トークンを選択する。生成されたツリーは、祖先のみに注目するアテンションマスクを用いて、ターゲットモデルの単一フォワードパスで効率的に検証される。DDTreeは推測的デコードの主要ドラフトモデルであるDFlash上に構築されるため、この性能向上によりDDTreeは推測的デコードの最先端手法の一つとなった。

English

Speculative decoding accelerates autoregressive language models by using a lightweight drafter to propose multiple future tokens, which the target model then verifies in parallel. DFlash shows that a block diffusion drafter can generate an entire draft block in a single forward pass and achieve state-of-the-art speculative decoding performance, outperforming strong autoregressive drafters such as EAGLE-3. Vanilla DFlash, however, still verifies only a single drafted trajectory per round, potentially limiting its acceptance length. We introduce DDTree (Diffusion Draft Tree), a method that constructs a draft tree directly from the per-position distributions of a block diffusion drafter. Under a fixed node budget, DDTree uses a simple best-first heap algorithm to select the continuations that are most likely to match the target model according to a surrogate defined by the draft model's output. The resulting tree is verified efficiently in a single target model forward pass using an ancestor-only attention mask. Because DDTree builds on DFlash, a leading draft model for speculative decoding, these gains place DDTree among the leading approaches to speculative decoding.

ブロック拡散ドラフトツリーによる投機的デコーディングの高速化

Accelerating Speculative Decoding with Block Diffusion Draft Trees

要旨

Support