使用分段推理解碼加速LLM推理
Accelerating LLM Inference with Staged Speculative Decoding
August 8, 2023
作者: Benjamin Spector, Chris Re
cs.AI
摘要
最近大型語言模型(LLM)的進展展示了它們多樣的能力。我們提出了一種新穎的算法,名為階段性推測解碼,以加速在小批次、設備上進行的LLM推理。我們通過改進先前的推測解碼工作,來應對小批次推理的低算術強度。首先,我們將推測批次重組為樹狀結構,從而降低生成成本並增加預期的每批次標記數。其次,我們增加了第二階段的推測解碼。綜合起來,我們在762M參數的GPT-2-L模型上將單批次解碼延遲時間降低了3.16倍,同時完美地保留了輸出質量。
English
Recent advances with large language models (LLM) illustrate their diverse
capabilities. We propose a novel algorithm, staged speculative decoding, to
accelerate LLM inference in small-batch, on-device scenarios. We address the
low arithmetic intensity of small-batch inference by improving upon previous
work in speculative decoding. First, we restructure the speculative batch as a
tree, which reduces generation costs and increases the expected tokens per
batch. Second, we add a second stage of speculative decoding. Taken together,
we reduce single-batch decoding latency by 3.16x with a 762M parameter GPT-2-L
model while perfectly preserving output quality.