SpecBlock：基于动态树草图的块迭代推测解码

摘要

推测解码通过起草一个候选续写树并在一次目标前向中验证来加速大语言模型推理。现有起草器分为两派，各有相反弱点。自回归式起草器（如EAGLE-3）沿每条起草路径保持依赖关系，但每个树深度调用一次起草器，导致起草占每次迭代延迟的显著部分。并行起草器通过一次前向预测多个未来位置来减少起草器调用，但每个位置预测时未考虑其他位置，生成的路径会被验证器拒绝。本文提出SpecBlock，一种将路径依赖性与低成本起草相结合的块迭代起草器。每次起草器前向产生K个依赖位置，我们称之为一个块。通过重复块扩展生成起草树。两种机制显式地携带路径依赖性以保持后续起草位置的准确性。在每个块内，层间偏移将前一位置的隐藏状态传递到每个解码器层。跨块时，每个新块可从上一块的任意位置开始，继承其隐藏状态以扩展路径。为了将验证器预算花在更可能被接受的位置上，一个联合训练的排名头取代固定top-k树，在起草过程中按位置分配分支。为了避免在推理中从未出现的词缀上训练起草器，一个有效词缀掩码会在早期位置出错时丢弃后续位置的损失。除了静态起草外，部署时的一个成本感知bandit利用免费验证器反馈有选择性地更新起草器，仅当预期吞吐量增益超过更新成本时才进行更新。实验表明，SpecBlock在起草成本仅为EAGLE-3的44-52%时，平均加速比提升8-13%，而成本感知自适应将该优势扩展到11-19%。

English

Speculative decoding accelerates LLM inference by drafting a tree of candidate continuations and verifying it in one target forward. Existing drafters fall into two camps with opposite weaknesses. Autoregressive drafters such as EAGLE-3 preserve dependence along each draft path but call the drafter once per tree depth, making drafting a non-trivial share of per-iteration latency. Parallel drafters cut drafter calls by predicting multiple future positions in one forward, but each position is predicted without seeing the others, producing paths the verifier rejects. In this paper, we propose SpecBlock, a block-iterative drafter that combines path dependence with cheap drafting. Each drafter forward produces K dependent positions and we call this a block. The draft tree grows through repeated block expansions. Two mechanisms explicitly carry path dependence to keep later draft positions accurate. Within each block, a layer-wise shift carries the previous position's hidden state into every decoder layer. Across blocks, each new block can start from any position of the previous block, inheriting its hidden state to extend the path. To spend verifier budget where acceptance is likely, a co-trained rank head replaces the fixed top-k tree by allocating per-position branching during drafting. To avoid training the drafter on prefixes it never produces at inference, a valid-prefix mask drops the loss at later positions once an earlier one is wrong. Beyond static drafting, a cost-aware bandit at deployment uses free verifier feedback to update the drafter selectively, only when the expected throughput gain exceeds the update cost. Experiments show that SpecBlock improves mean speedup by 8-13% over EAGLE-3 at 44-52% of its drafting cost, and cost-aware adaptation extends this lead to 11-19%.

SpecBlock：基于动态树草图的块迭代推测解码

SpecBlock: Block-Iterative Speculative Decoding with Dynamic Tree Drafting

摘要

Support