少草稿,多檢索:用於推測性解碼的混合樹狀結構建構
Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding
May 19, 2026
作者: Yuhao Shen, Tianyu Liu, Xinyi Hu, Quan Kong, Baolin Zhang, Jun Dai, Jun Zhang, Shuang Ge, Lei Chen, Yue Li, Mingcheng Wan, Cong Wang
cs.AI
摘要
推測解碼(SD)透過「先草稿再驗證」的典範來加速大型語言模型的推論。為了最大化接受率,近期方法建構了龐大的草稿樹,但這些樹卻不幸地導致嚴重的VRAM頻寬與計算開銷,成為端到端加速的瓶頸。雖然動態深度剪枝可以透過移除邊緣分支來降低此延遲,但它也會丟棄潛在有效的候選項,使接受率無法達到密集樹的上限。在本文中,我們識別出資源配置中的一個關鍵機會:從密集草稿到剪枝草稿的轉變釋放了顯著的計算預算。
為了打破這種帕累托權衡,我們引入了Graft,這是一個補償框架,將剪枝與檢索耦合為相互強化的操作。剪枝為檢索提供充足的預算,而檢索則補償剪枝造成的覆蓋損失,並恢復接受長度。透過採用順序的「先剪後接」機制,Graft將高度預測性的檢索令牌附加到剪枝所開放的位置上,以近乎零的開銷填補拓撲間隙。Graft完全無需訓練且無損。
全面的評估顯示,Graft在實際部署場景中建立了新的帕累托前沿,包括短上下文生成、長上下文生成以及大規模模型。在短上下文基準測試中,它實現了高達5.41倍的加速,並在大型Qwen3-235B模型上比EAGLE-3的平均加速提升了高達21.8%。我們還初步探索了將Graft應用於DFlash風格的塊草稿典範,為將嫁接擴展到自迴歸草稿樹之外提供了初步證據與見解。
English
Speculative decoding (SD) accelerates large language model inference by leveraging a draft-then-verify paradigm. To maximize the acceptance rate, recent methods construct expansive draft trees, which unfortunately incur severe VRAM bandwidth and computational overheads that bottleneck end-to-end speedups. While dynamic-depth pruning can reduce this latency by removing marginal branches, it also discards potentially valid candidates, preventing the acceptance rate from reaching the upper bound of dense trees. In this paper, we identify a critical opportunity in resource allocation: the transition from dense to pruned drafting frees up significant computational budget. To break this Pareto tradeoff, we introduce Graft, a compensation framework that couples pruning and retrieval as mutually reinforcing operations. Pruning supplies sufficient budget for retrieval, while retrieval compensates for pruning-induced coverage loss and recovers accepted length. By employing a sequential `prune-then-graft' mechanism, Graft attaches highly predictive retrieved tokens into positions opened by pruning, filling the topological gaps with near-zero overhead. Graft is entirely training-free and lossless. Comprehensive evaluations show that Graft establishes a new Pareto frontier across practical deployment settings, including short-context generation, long-context generation, and large-scale models. On short-context benchmarks, it achieves up to 5.41times speedup and improves average speedup over EAGLE-3 by up to 21.8% on the large-scale Qwen3-235B. We also provide a preliminary exploration of applying Graft to the DFlash-style block drafting paradigm, offering initial evidence and insights for extending grafting beyond autoregressive draft trees.