SpecBlock：基於動態樹狀草稿的區塊迭代推測解碼

摘要

推测解码通过草拟一组候选延续的树并在一次目标前向中验证，从而加速大语言模型推理。现有草案生成器分为两类，各有相反弱点。自回归式草案生成器（如EAGLE-3）保留每条草拟路径内的依赖关系，但每层树深度需调用一次生成器，使得草拟在每次迭代延迟中占用可观份额。并行草案生成器通过一次前向预测多个未来位置来减少生成器调用，但每个位置预测时未参考其他位置，导致生成的路径被验证器拒绝。本文提出SpecBlock，一种结合路径依赖性与低成本草拟的块迭代草案生成器。每个生成器前向产生K个依赖位置，我们称其为一个块。草拟树通过重复的块扩展生长。两种机制显式携带路径依赖性，以保持后续草拟位置的准确性。在每个块内，通过逐层偏移将前一位置的隐藏状态传入每个解码器层。跨块时，每个新块可从上一块的任意位置启动，继承其隐藏状态以扩展路径。为了在验证器预算中投入高接受可能性位置，一个联合训练的排序头替代固定top-k树，在草拟过程中按位置分配分支。为避免生成器在推理中从未见过的前缀上进行训练，有效前缀掩码在较早位置出错时丢弃后续位置的损失。在静态草拟之外，部署时采用成本感知的bandit算法，利用免费验证器反馈选择性更新生成器，仅当预期吞吐量增益超过更新成本时进行。实验表明，SpecBlock相较于EAGLE-3平均加速比提升8-13%，而草拟成本仅为后者的44-52%；成本感知自适应扩展将此优势进一步扩大至11-19%。

English

Speculative decoding accelerates LLM inference by drafting a tree of candidate continuations and verifying it in one target forward. Existing drafters fall into two camps with opposite weaknesses. Autoregressive drafters such as EAGLE-3 preserve dependence along each draft path but call the drafter once per tree depth, making drafting a non-trivial share of per-iteration latency. Parallel drafters cut drafter calls by predicting multiple future positions in one forward, but each position is predicted without seeing the others, producing paths the verifier rejects. In this paper, we propose SpecBlock, a block-iterative drafter that combines path dependence with cheap drafting. Each drafter forward produces K dependent positions and we call this a block. The draft tree grows through repeated block expansions. Two mechanisms explicitly carry path dependence to keep later draft positions accurate. Within each block, a layer-wise shift carries the previous position's hidden state into every decoder layer. Across blocks, each new block can start from any position of the previous block, inheriting its hidden state to extend the path. To spend verifier budget where acceptance is likely, a co-trained rank head replaces the fixed top-k tree by allocating per-position branching during drafting. To avoid training the drafter on prefixes it never produces at inference, a valid-prefix mask drops the loss at later positions once an earlier one is wrong. Beyond static drafting, a cost-aware bandit at deployment uses free verifier feedback to update the drafter selectively, only when the expected throughput gain exceeds the update cost. Experiments show that SpecBlock improves mean speedup by 8-13% over EAGLE-3 at 44-52% of its drafting cost, and cost-aware adaptation extends this lead to 11-19%.

SpecBlock：基於動態樹狀草稿的區塊迭代推測解碼

SpecBlock: Block-Iterative Speculative Decoding with Dynamic Tree Drafting

摘要

Support