使用分阶段投机解码加速LLM推理

摘要

最近对大型语言模型（LLM）的进展展示了它们多样的能力。我们提出了一种新颖的算法，即分阶段推测解码，以加速在小批量、设备端场景下的LLM推断。我们通过改进先前的推测解码工作来解决小批量推断的低算术强度问题。首先，我们将推测批量重组为一棵树，从而降低生成成本并增加每批次的预期标记数。其次，我们添加第二阶段的推测解码。综合起来，我们将单批次解码延迟缩短了3.16倍，使用762M参数的GPT-2-L模型，同时完美保持输出质量。

English

Recent advances with large language models (LLM) illustrate their diverse capabilities. We propose a novel algorithm, staged speculative decoding, to accelerate LLM inference in small-batch, on-device scenarios. We address the low arithmetic intensity of small-batch inference by improving upon previous work in speculative decoding. First, we restructure the speculative batch as a tree, which reduces generation costs and increases the expected tokens per batch. Second, we add a second stage of speculative decoding. Taken together, we reduce single-batch decoding latency by 3.16x with a 762M parameter GPT-2-L model while perfectly preserving output quality.

使用分阶段投机解码加速LLM推理

Accelerating LLM Inference with Staged Speculative Decoding

摘要

Support