Ouroboros:大模型增强起草的推测解码
Ouroboros: Speculative Decoding with Large Model Enhanced Drafting
February 21, 2024
作者: Weilin Zhao, Yuxiang Huang, Xu Han, Chaojun Xiao, Zhiyuan Liu, Maosong Sun
cs.AI
摘要
草拟-验证解码方法,如猜测解码,是广泛采用的无需训练的方法,用于加速大型语言模型(LLMs)的推理。猜测解码不像使用自回归过程按顺序解码标记,而是首先用高效的小模型创建草稿。然后需要LLMs以非自回归方式进行验证和校正,以最小化时间开销。生成更长的草稿一旦经过验证可以带来更显著的加速,但如果失败也会产生相当大的试错成本。由于高验证失败概率的影响,现有解码方法无法一次为验证草拟太多内容,从而实现次优的推理加速。在本文中,我们介绍了Ouroboros,它从LLMs的验证过程中构建短语候选池,为小模型的草拟生成提供候选。因此,Ouroboros可以进一步提高初始草稿的效率和有效性。在典型文本生成任务上的实验结果显示,与前瞻解码和猜测解码相比,Ouroboros实现了高达1.9倍和2.8倍的加速。Ouroboros的源代码可在https://github.com/thunlp/Ouroboros 上找到。
English
Drafting-then-verifying decoding methods such as speculative decoding are
widely adopted training-free methods to accelerate the inference of large
language models (LLMs). Instead of employing an autoregressive process to
decode tokens sequentially, speculative decoding initially creates drafts with
an efficient small model. Then LLMs are required to conduct verification and
correction in a non-autoregressive fashion to minimize time overhead.
Generating longer drafts can lead to even more significant speedups once
verified, but also incurs substantial trial and error costs if it fails.
Suffering from the high verification failure probability, existing decoding
methods cannot draft too much content for verification at one time, achieving
sub-optimal inference acceleration. In this paper, we introduce Ouroboros,
which constructs a phrase candidate pool from the verification process of LLMs
to provide candidates for draft generation of the small model. Thereby,
Ouroboros can further improve the efficiency and effectiveness of the initial
drafts. The experimental results on typical text generation tasks show that
Ouroboros achieves speedups of up to 1.9x and 2.8x compared to lookahead
decoding and speculative decoding, respectively. The source code of Ouroboros
is available at https://github.com/thunlp/Ouroboros.