추측 파이프라인 디코딩: 파이프라인 병렬 처리를 통한 더 높은 정확도와 제로 버블 추측

초록

투기적 디코딩(Speculative Decoding, SD)은 초안 작성 후 검증(draft-then-verify) 패러다임을 통해 낮은 동시성의 LLM 추론을 가속화한다. 그러나 주류 방법들은 일반적으로 다중 토큰 예측에 의존하며, 이는 예측 난이도의 증가와 직렬 초안 작성 지연 시간을 초래한다. 이러한 문제를 해결하기 위해 우리는 파이프라인 병렬 처리의 진정한 잠재력을 발휘하는 혁신적인 프레임워크인 투기적 파이프라인 디코딩(Speculative Pipeline Decoding, SPD)을 제안한다. SPD는 대상 LLM을 n개의 파이프라인 단계로 분할하여 LLM이 n개의 토큰을 병렬로 처리할 수 있게 함으로써 디코딩을 가속화한다. 단일 시퀀스 디코딩에서 파이프라인을 지속적으로 채우기 위해, 투기 모듈은 서로 다른 파이프라인 깊이에 걸쳐 중간 특징을 집계하여 다음 토큰을 예측하며, 이는 대상 모델의 파이프라인 단계와 완전히 병렬로 실행되어 제한된 난이도, 더 높은 수용률, 그리고 제로 지연 버블을 실현한다. 실험 결과, SPD는 주류 기준선에 비해 현저히 높은 이론적 속도 향상을 달성하며, LLM 디코딩 가속화를 위한 확장성이 뛰어난 솔루션을 제공함을 보여준다. 코드는 https://github.com/yuyijiong/speculative_pipeline_decoding에서 확인할 수 있다.

English

Speculative Decoding (SD) accelerates low-concurrency LLM inference by employing a draft-then-verify paradigm. However, mainstream methods typically rely on multi-token prediction, which introduces escalating prediction difficulty and serial drafting latency. To address these, we propose Speculative Pipeline Decoding (SPD), a groundbreaking framework that unlocks the true potential of pipeline parallelism. By partitioning the target LLM into n pipeline stages, SPD allows LLM to process n tokens in parallel to accelerate decoding. To continuous fill the pipeline in single sequence decoding, a speculation module aggregates intermediate features across different pipeline depths to predict the next token, executing strictly in parallel with the target model's pipeline step, to realize bounded difficulty, higher acceptance rates, and zero latency bubbles. Our experiments demonstrate that SPD achieves a significantly higher theoretical speedup compared to mainstream baselines, offering a highly scalable solution for LLM decoding acceleration. Our code is available at https://github.com/yuyijiong/speculative_pipeline_decoding