投機的パイプラインデコード：パイプライン並列性による高精度かつゼロバブルの投機

要旨

投機的復号（Speculative Decoding, SD）は、ドラフト・検証パラダイムを採用することで、低並列性のLLM推論を高速化する。しかし、主流の手法は通常、マルチトークン予測に依存しており、予測難易度の増大と逐次的なドラフト生成のレイテンシをもたらす。これらの課題に対処するため、我々は投機的パイプライン復号（Speculative Pipeline Decoding, SPD）を提案する。これはパイプライン並列処理の真の可能性を引き出す画期的なフレームワークである。ターゲットLLMをn個のパイプライン段に分割することで、SPDはLLMがn個のトークンを並列に処理し、復号を高速化することを可能にする。単一シーケンス復号においてパイプラインを継続的に満たすため、投機モジュールが異なるパイプライン深さにわたる中間特徴量を集約して次のトークンを予測し、ターゲットモデルのパイプラインステップと厳密に並列に実行することで、バウンドされた難易度、高い受理率、およびゼロレイテンシーバブルを実現する。実験により、SPDは主流のベースラインと比較して著しく高い理論的高速化を達成し、LLM復号高速化のための高いスケーラビリティを備えたソリューションを提供することを示す。我々のコードはhttps://github.com/yuyijiong/speculative_pipeline_decodingで公開されている。

English

Speculative Decoding (SD) accelerates low-concurrency LLM inference by employing a draft-then-verify paradigm. However, mainstream methods typically rely on multi-token prediction, which introduces escalating prediction difficulty and serial drafting latency. To address these, we propose Speculative Pipeline Decoding (SPD), a groundbreaking framework that unlocks the true potential of pipeline parallelism. By partitioning the target LLM into n pipeline stages, SPD allows LLM to process n tokens in parallel to accelerate decoding. To continuous fill the pipeline in single sequence decoding, a speculation module aggregates intermediate features across different pipeline depths to predict the next token, executing strictly in parallel with the target model's pipeline step, to realize bounded difficulty, higher acceptance rates, and zero latency bubbles. Our experiments demonstrate that SPD achieves a significantly higher theoretical speedup compared to mainstream baselines, offering a highly scalable solution for LLM decoding acceleration. Our code is available at https://github.com/yuyijiong/speculative_pipeline_decoding