Speculatieve pijplijndecodering: hogere nauwkeurigheid en bubbelvrije speculatie via pijplijnparallelisme

Samenvatting

Speculatieve decodering (SD) versnelt LLM-inferentie bij lage concurrency door gebruik te maken van een ontwerp-en-verifieerparadigma. Echter, mainstream methoden vertrouwen doorgaans op multi-tokenvoorspelling, wat leidt tot escalerende voorspellingsmoeilijkheid en seriële ontwerplateniteit. Om deze problemen aan te pakken, stellen wij Speculatieve Pijplijn Decodering (SPD) voor, een baanbrekend raamwerk dat het ware potentieel van pijplijnparallelisme ontsluit. Door het doel-LLM op te delen in n pijplijnstadia, stelt SPD het LLM in staat om n tokens parallel te verwerken en zo de decodering te versnellen. Om de pijplijn continu te vullen bij het decoderen van een enkele sequentie, aggregeert een speculatiemodule tussenliggende kenmerken over verschillende pijplijndieptes om het volgende token te voorspellen, strikt parallel uitgevoerd met de pijplijnstap van het doelmodel, wat resulteert in begrensde moeilijkheid, hogere acceptatiepercentages en nul latentiebubbels. Onze experimenten tonen aan dat SPD een aanzienlijk hogere theoretische versnelling behaalt in vergelijking met gangbare basislijnen, en biedt een zeer schaalbare oplossing voor versnelling van LLM-decodering. Onze code is beschikbaar op https://github.com/yuyijiong/speculative_pipeline_decoding.

English

Speculative Decoding (SD) accelerates low-concurrency LLM inference by employing a draft-then-verify paradigm. However, mainstream methods typically rely on multi-token prediction, which introduces escalating prediction difficulty and serial drafting latency. To address these, we propose Speculative Pipeline Decoding (SPD), a groundbreaking framework that unlocks the true potential of pipeline parallelism. By partitioning the target LLM into n pipeline stages, SPD allows LLM to process n tokens in parallel to accelerate decoding. To continuous fill the pipeline in single sequence decoding, a speculation module aggregates intermediate features across different pipeline depths to predict the next token, executing strictly in parallel with the target model's pipeline step, to realize bounded difficulty, higher acceptance rates, and zero latency bubbles. Our experiments demonstrate that SPD achieves a significantly higher theoretical speedup compared to mainstream baselines, offering a highly scalable solution for LLM decoding acceleration. Our code is available at https://github.com/yuyijiong/speculative_pipeline_decoding