推測管線解碼：透過管線平行化實現更高準確度與零泡沫推測

摘要

推測解碼（Speculative Decoding，SD）透過「先草擬再驗證」的範式加速低並行度的LLM推論。然而，主流方法通常依賴多詞元預測，導致預測難度逐步提升及序列化的草擬延遲。為解決此問題，我們提出「推測管線解碼」（Speculative Pipeline Decoding，SPD），這是一個開創性框架，充分釋放管線平行化的潛力。透過將目標LLM分割為n個管線階段，SPD使LLM能平行處理n個詞元以加速解碼。為在單一序列解碼中持續填滿管線，我們設計了一個推測模組，匯聚不同管線深度的中間特徵來預測下一個詞元，並嚴格與目標模型的管線步驟平行執行，從而實現邊界可控的難度、更高的接受率及零延遲氣泡。實驗結果顯示，與主流基準方法相比，SPD達成了顯著更高的理論加速比，為LLM解碼加速提供了一個高度可擴展的解決方案。我們的程式碼已公開於 https://github.com/yuyijiong/speculative_pipeline_decoding。

English

Speculative Decoding (SD) accelerates low-concurrency LLM inference by employing a draft-then-verify paradigm. However, mainstream methods typically rely on multi-token prediction, which introduces escalating prediction difficulty and serial drafting latency. To address these, we propose Speculative Pipeline Decoding (SPD), a groundbreaking framework that unlocks the true potential of pipeline parallelism. By partitioning the target LLM into n pipeline stages, SPD allows LLM to process n tokens in parallel to accelerate decoding. To continuous fill the pipeline in single sequence decoding, a speculation module aggregates intermediate features across different pipeline depths to predict the next token, executing strictly in parallel with the target model's pipeline step, to realize bounded difficulty, higher acceptance rates, and zero latency bubbles. Our experiments demonstrate that SPD achieves a significantly higher theoretical speedup compared to mainstream baselines, offering a highly scalable solution for LLM decoding acceleration. Our code is available at https://github.com/yuyijiong/speculative_pipeline_decoding