Ouroboros：大模型增强起草的推测解码

摘要

草拟-验证解码方法，如猜测解码，是广泛采用的无需训练的方法，用于加速大型语言模型（LLMs）的推理。猜测解码不像使用自回归过程按顺序解码标记，而是首先用高效的小模型创建草稿。然后需要LLMs以非自回归方式进行验证和校正，以最小化时间开销。生成更长的草稿一旦经过验证可以带来更显著的加速，但如果失败也会产生相当大的试错成本。由于高验证失败概率的影响，现有解码方法无法一次为验证草拟太多内容，从而实现次优的推理加速。在本文中，我们介绍了Ouroboros，它从LLMs的验证过程中构建短语候选池，为小模型的草拟生成提供候选。因此，Ouroboros可以进一步提高初始草稿的效率和有效性。在典型文本生成任务上的实验结果显示，与前瞻解码和猜测解码相比，Ouroboros实现了高达1.9倍和2.8倍的加速。Ouroboros的源代码可在https://github.com/thunlp/Ouroboros 上找到。

English

Drafting-then-verifying decoding methods such as speculative decoding are widely adopted training-free methods to accelerate the inference of large language models (LLMs). Instead of employing an autoregressive process to decode tokens sequentially, speculative decoding initially creates drafts with an efficient small model. Then LLMs are required to conduct verification and correction in a non-autoregressive fashion to minimize time overhead. Generating longer drafts can lead to even more significant speedups once verified, but also incurs substantial trial and error costs if it fails. Suffering from the high verification failure probability, existing decoding methods cannot draft too much content for verification at one time, achieving sub-optimal inference acceleration. In this paper, we introduce Ouroboros, which constructs a phrase candidate pool from the verification process of LLMs to provide candidates for draft generation of the small model. Thereby, Ouroboros can further improve the efficiency and effectiveness of the initial drafts. The experimental results on typical text generation tasks show that Ouroboros achieves speedups of up to 1.9x and 2.8x compared to lookahead decoding and speculative decoding, respectively. The source code of Ouroboros is available at https://github.com/thunlp/Ouroboros.

Ouroboros：大模型增强起草的推测解码

Ouroboros: Speculative Decoding with Large Model Enhanced Drafting

摘要

Summary

Support

Support