加速推测解码：基于块扩散草稿树的方法

摘要

推测性解码通过使用轻量级草稿模型预测多个未来词元，再由目标模型并行验证，从而加速自回归语言模型推理。DFlash研究表明，基于块扩散的草稿模型可在单次前向传播中生成完整草稿块，实现了最先进的推测性解码性能，超越了EAGLE-3等强自回归草稿模型。然而经典DFlash每轮仅验证单条草稿轨迹，可能限制其接受长度。我们提出DDTree（扩散草稿树）方法，可直接基于块扩散草稿模型的逐位置分布构建草稿树。在固定节点预算下，DDTree采用简单的最佳优先堆算法，根据草稿模型输出定义的代理指标选择最可能匹配目标模型的续写路径。通过仅关注祖先节点的注意力掩码，生成的草稿树可在单次目标模型前向传播中高效完成验证。由于DDTree基于推测性解码领域的领先草稿模型DFlash构建，这些性能提升使DDTree跻身推测性解码的前沿方法之列。

English

Speculative decoding accelerates autoregressive language models by using a lightweight drafter to propose multiple future tokens, which the target model then verifies in parallel. DFlash shows that a block diffusion drafter can generate an entire draft block in a single forward pass and achieve state-of-the-art speculative decoding performance, outperforming strong autoregressive drafters such as EAGLE-3. Vanilla DFlash, however, still verifies only a single drafted trajectory per round, potentially limiting its acceptance length. We introduce DDTree (Diffusion Draft Tree), a method that constructs a draft tree directly from the per-position distributions of a block diffusion drafter. Under a fixed node budget, DDTree uses a simple best-first heap algorithm to select the continuations that are most likely to match the target model according to a surrogate defined by the draft model's output. The resulting tree is verified efficiently in a single target model forward pass using an ancestor-only attention mask. Because DDTree builds on DFlash, a leading draft model for speculative decoding, these gains place DDTree among the leading approaches to speculative decoding.