推测性流水线解码：通过流水线并行实现更高准确性与零气泡推测

摘要

推测性解码（SD）通过采用“先草拟后验证”的范式，加速了低并发大语言模型（LLM）推理。然而，主流方法通常依赖多令牌预测，这会引入递增的预测难度和串行草拟延迟。为解决这些问题，我们提出了推测性流水线解码（SPD），这是一个突破性框架，能够释放流水线并行的真正潜力。通过将目标LLM划分为n个流水线阶段，SPD允许LLM并行处理n个令牌以加速解码。为了在单序列解码中持续填充流水线，一个推测模块整合了不同流水线深度的中间特征来预测下一个令牌，严格与目标模型的流水线步骤并行执行，从而实现有界的预测难度、更高的接受率和零延迟气泡。实验表明，与主流基线相比，SPD实现了显著更高的理论加速，为LLM解码加速提供了一种高度可扩展的解决方案。我们的代码开源在：https://github.com/yuyijiong/speculative_pipeline_decoding。

English

Speculative Decoding (SD) accelerates low-concurrency LLM inference by employing a draft-then-verify paradigm. However, mainstream methods typically rely on multi-token prediction, which introduces escalating prediction difficulty and serial drafting latency. To address these, we propose Speculative Pipeline Decoding (SPD), a groundbreaking framework that unlocks the true potential of pipeline parallelism. By partitioning the target LLM into n pipeline stages, SPD allows LLM to process n tokens in parallel to accelerate decoding. To continuous fill the pipeline in single sequence decoding, a speculation module aggregates intermediate features across different pipeline depths to predict the next token, executing strictly in parallel with the target model's pipeline step, to realize bounded difficulty, higher acceptance rates, and zero latency bubbles. Our experiments demonstrate that SPD achieves a significantly higher theoretical speedup compared to mainstream baselines, offering a highly scalable solution for LLM decoding acceleration. Our code is available at https://github.com/yuyijiong/speculative_pipeline_decoding