使用分段推理解碼加速LLM推理

摘要

最近大型語言模型（LLM）的進展展示了它們多樣的能力。我們提出了一種新穎的算法，名為階段性推測解碼，以加速在小批次、設備上進行的LLM推理。我們通過改進先前的推測解碼工作，來應對小批次推理的低算術強度。首先，我們將推測批次重組為樹狀結構，從而降低生成成本並增加預期的每批次標記數。其次，我們增加了第二階段的推測解碼。綜合起來，我們在762M參數的GPT-2-L模型上將單批次解碼延遲時間降低了3.16倍，同時完美地保留了輸出質量。

English

Recent advances with large language models (LLM) illustrate their diverse capabilities. We propose a novel algorithm, staged speculative decoding, to accelerate LLM inference in small-batch, on-device scenarios. We address the low arithmetic intensity of small-batch inference by improving upon previous work in speculative decoding. First, we restructure the speculative batch as a tree, which reduces generation costs and increases the expected tokens per batch. Second, we add a second stage of speculative decoding. Taken together, we reduce single-batch decoding latency by 3.16x with a 762M parameter GPT-2-L model while perfectly preserving output quality.

使用分段推理解碼加速LLM推理

Accelerating LLM Inference with Staged Speculative Decoding

摘要

Support