ChatPaper.aiChatPaper

批量推测式解码的正确实现之道

Batch Speculative Decoding Done Right

October 26, 2025
作者: Ranran Haoran Zhang, Soumik Dey, Ashirbad Mishra, Hansi Wu, Binbin Li, Rui Zhang
cs.AI

摘要

推测解码通过使用小型草稿模型预测多个候选标记,并由目标模型并行验证,从而加速大语言模型推理。将这一思路扩展至批处理对生产环境部署至关重要,但会引入非规则张量问题:同一批次中的序列接受的草稿标记数量不同,这会破坏右对齐结构,导致位置编码、注意力掩码和KV缓存状态异常。我们发现现有多种批处理实现违反了输出等价性原则——即推测解码必须与标准自回归生成产生完全相同标记序列的基本要求。这些违规现象正是由于对非规则张量问题的处理不当所致。为此我们(1)明确了保证正确性的同步要求,(2)提出正确性优先的批处理推测解码算法EQSPEC,其显示重对齐操作消耗了40%的开销,(3)引入EXSPEC算法,通过维护序列滑动池并动态组建等长分组,在保持单序列推测加速的同时降低重对齐开销。在SpecBench数据集上,基于Vicuna-7B/68M、Qwen3-8B/0.6B和GLM-4-9B/0.6B等目标/草稿模型对的实验表明,相较于批大小为1的基准,我们的方法在批大小为8时实现了最高3倍的吞吐量提升,且能保持95%的输出等价性。该方案无需定制化内核,可无缝集成至现有推理框架。代码已开源:https://github.com/eBay/spec_dec。
English
Speculative decoding speeds up LLM inference by using a small draft model to propose multiple tokens that a target model verifies in parallel. Extending this idea to batches is essential for production serving, but it introduces the ragged tensor problem: sequences in the same batch accept different numbers of draft tokens, breaking right-alignment and corrupting position IDs, attention masks, and KV-cache state. We show that several existing batch implementations violate output equivalence-the fundamental requirement that speculative decoding must produce identical token sequences to standard autoregressive generation. These violations occur precisely due to improper handling of the ragged tensor problem. In response, we (1) characterize the synchronization requirements that guarantee correctness, (2) present a correctness-first batch speculative decoding EQSPEC that exposes realignment as consuming 40% of overhead, and (3) introduce EXSPEC, which maintains a sliding pool of sequences and dynamically forms same-length groups, to reduce the realignment overhead while preserving per-sequence speculative speedups. On the SpecBench dataset, across Vicuna-7B/68M, Qwen3-8B/0.6B, and GLM-4-9B/0.6B target/draft pairs, our approach achieves up to 3times throughput improvement at batch size 8 compared to batch size 1, with efficient scaling through batch size 8, while maintaining 95% output equivalence. Our method requires no custom kernels and integrates cleanly with existing inference stacks. Our code is available at https://github.com/eBay/spec_dec.
PDF231December 1, 2025