ChatPaper.aiChatPaper

少草稿,多检索:用于推测解码的混合树构建

Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding

May 19, 2026
作者: Yuhao Shen, Tianyu Liu, Xinyi Hu, Quan Kong, Baolin Zhang, Jun Dai, Jun Zhang, Shuang Ge, Lei Chen, Yue Li, Mingcheng Wan, Cong Wang
cs.AI

摘要

投机解码(SD)通过利用“草稿-验证”范式来加速大语言模型推理。为了最大化接受率,近期方法构建了庞大的草稿树,但这不幸导致了严重的显存带宽和计算开销,成为端到端加速的瓶颈。尽管动态深度剪枝可通过移除边缘分支来减少延迟,但它也丢弃了可能有效的候选者,使得接受率无法达到密集树的上限。本文揭示了资源分配中的一个关键机遇:从密集草稿转向剪枝草稿可释放大量计算预算。为了打破这一帕累托权衡,我们提出了Graft——一种将剪枝与检索作为相互增强操作的补偿框架。剪枝为检索提供充足预算,而检索则补偿剪枝造成的覆盖损失并恢复接受长度。通过采用顺序的“先剪后接”机制,Graft将高预测性的检索令牌附着于剪枝所开辟的位置,以近乎零开销填补拓扑缺口。Graft完全无需训练且无损失。全面评估表明,Graft在短上下文生成、长上下文生成和大规模模型等实际部署场景中均建立了新的帕累托前沿。在短上下文基准测试中,它实现了高达5.41倍的加速,并在大规模Qwen3-235B模型上将平均加速比相较EAGLE-3提升了最多21.8%。此外,我们初步探索了将Graft应用于DFlash风格的分块草稿范式,为将“嫁接”扩展到自回归草稿树之外提供了初步证据与见解。
English
Speculative decoding (SD) accelerates large language model inference by leveraging a draft-then-verify paradigm. To maximize the acceptance rate, recent methods construct expansive draft trees, which unfortunately incur severe VRAM bandwidth and computational overheads that bottleneck end-to-end speedups. While dynamic-depth pruning can reduce this latency by removing marginal branches, it also discards potentially valid candidates, preventing the acceptance rate from reaching the upper bound of dense trees. In this paper, we identify a critical opportunity in resource allocation: the transition from dense to pruned drafting frees up significant computational budget. To break this Pareto tradeoff, we introduce Graft, a compensation framework that couples pruning and retrieval as mutually reinforcing operations. Pruning supplies sufficient budget for retrieval, while retrieval compensates for pruning-induced coverage loss and recovers accepted length. By employing a sequential `prune-then-graft' mechanism, Graft attaches highly predictive retrieved tokens into positions opened by pruning, filling the topological gaps with near-zero overhead. Graft is entirely training-free and lossless. Comprehensive evaluations show that Graft establishes a new Pareto frontier across practical deployment settings, including short-context generation, long-context generation, and large-scale models. On short-context benchmarks, it achieves up to 5.41times speedup and improves average speedup over EAGLE-3 by up to 21.8% on the large-scale Qwen3-235B. We also provide a preliminary exploration of applying Graft to the DFlash-style block drafting paradigm, offering initial evidence and insights for extending grafting beyond autoregressive draft trees.