ChatPaper.aiChatPaper

使用分段推理解碼加速LLM推理

Accelerating LLM Inference with Staged Speculative Decoding

August 8, 2023
作者: Benjamin Spector, Chris Re
cs.AI

摘要

最近大型語言模型(LLM)的進展展示了它們多樣的能力。我們提出了一種新穎的算法,名為階段性推測解碼,以加速在小批次、設備上進行的LLM推理。我們通過改進先前的推測解碼工作,來應對小批次推理的低算術強度。首先,我們將推測批次重組為樹狀結構,從而降低生成成本並增加預期的每批次標記數。其次,我們增加了第二階段的推測解碼。綜合起來,我們在762M參數的GPT-2-L模型上將單批次解碼延遲時間降低了3.16倍,同時完美地保留了輸出質量。
English
Recent advances with large language models (LLM) illustrate their diverse capabilities. We propose a novel algorithm, staged speculative decoding, to accelerate LLM inference in small-batch, on-device scenarios. We address the low arithmetic intensity of small-batch inference by improving upon previous work in speculative decoding. First, we restructure the speculative batch as a tree, which reduces generation costs and increases the expected tokens per batch. Second, we add a second stage of speculative decoding. Taken together, we reduce single-batch decoding latency by 3.16x with a 762M parameter GPT-2-L model while perfectly preserving output quality.
PDF254December 15, 2024