批次推测式解码的正确实践之道
Batch Speculative Decoding Done Right
October 26, 2025
作者: Ranran Haoran Zhang, Soumik Dey, Ashirbad Mishra, Hansi Wu, Binbin Li, Rui Zhang
cs.AI
摘要
推理性解码通过使用小型草稿模型预测多个待选标记,并由目标模型并行验证的方式加速大语言模型推理。将这一思路扩展至批处理场景对生产环境部署至关重要,但会引发非规则张量问题:同一批次中的序列会接受不同数量的草稿标记,破坏右对齐特性并导致位置编码、注意力掩码和KV缓存状态紊乱。我们发现现有多种批处理实现会违反输出等价性——即推理性解码必须与标准自回归生成产生完全相同标记序列的基本要求。这些违规现象正是由于对非规则张量问题的处理不当所致。为此,我们(1)明确了保证正确性的同步需求规范,(2)提出正确性优先的批处理推理性解码方案EQSPEC,揭示重对齐操作占总开销的40%,(3)设计EXSPEC方案,通过维护序列滑动池动态组建等长组,在保持单序列加速效果的同时降低重对齐开销。在SpecBench数据集上,基于Vicuna-7B/68M、Qwen3-8B/0.6B和GLM-4-9B/0.6B的目标/草稿模型组合,我们的方法在批次大小为8时相比单序列处理实现最高3倍吞吐量提升,且能保持95%的输出等价性。该方案无需定制化内核,可无缝集成现有推理框架。代码已开源:https://github.com/eBay/spec_dec。
English
Speculative decoding speeds up LLM inference by using a small draft model to
propose multiple tokens that a target model verifies in parallel. Extending
this idea to batches is essential for production serving, but it introduces the
ragged tensor problem: sequences in the same batch accept different numbers of
draft tokens, breaking right-alignment and corrupting position IDs, attention
masks, and KV-cache state. We show that several existing batch implementations
violate output equivalence-the fundamental requirement that speculative
decoding must produce identical token sequences to standard autoregressive
generation. These violations occur precisely due to improper handling of the
ragged tensor problem. In response, we (1) characterize the synchronization
requirements that guarantee correctness, (2) present a correctness-first batch
speculative decoding EQSPEC that exposes realignment as consuming 40% of
overhead, and (3) introduce EXSPEC, which maintains a sliding pool of sequences
and dynamically forms same-length groups, to reduce the realignment overhead
while preserving per-sequence speculative speedups. On the SpecBench dataset,
across Vicuna-7B/68M, Qwen3-8B/0.6B, and GLM-4-9B/0.6B target/draft pairs, our
approach achieves up to 3times throughput improvement at batch size 8
compared to batch size 1, with efficient scaling through batch size 8, while
maintaining 95% output equivalence. Our method requires no custom kernels and
integrates cleanly with existing inference stacks. Our code is available at
https://github.com/eBay/spec_dec.