Ouroboros:具有大型模型增強起草的推理解碼
Ouroboros: Speculative Decoding with Large Model Enhanced Drafting
February 21, 2024
作者: Weilin Zhao, Yuxiang Huang, Xu Han, Chaojun Xiao, Zhiyuan Liu, Maosong Sun
cs.AI
摘要
草擬-驗證解碼方法,如猜測性解碼,是廣泛採用的無需訓練的方法,用於加速大型語言模型(LLMs)的推理。與使用自回歸過程來按順序解碼標記不同,猜測性解碼首先使用高效的小型模型創建草稿。然後,LLMs需要以非自回歸方式進行驗證和修正,以最小化時間開銷。生成較長的草稿一旦經過驗證,可以導致更顯著的加速,但如果失敗,也會產生相當大的試誤成本。由於高驗證失敗概率的影響,現有的解碼方法無法一次為驗證起草太多內容,從而達不到最佳的推理加速。在本文中,我們介紹了Ouroboros,它從LLMs的驗證過程中構建短語候選池,為小型模型的草稿生成提供候選。因此,Ouroboros可以進一步提高初始草稿的效率和有效性。對典型文本生成任務的實驗結果顯示,Ouroboros相對於預視解碼和猜測性解碼,實現了高達1.9倍和2.8倍的加速。Ouroboros的源代碼可在https://github.com/thunlp/Ouroboros找到。
English
Drafting-then-verifying decoding methods such as speculative decoding are
widely adopted training-free methods to accelerate the inference of large
language models (LLMs). Instead of employing an autoregressive process to
decode tokens sequentially, speculative decoding initially creates drafts with
an efficient small model. Then LLMs are required to conduct verification and
correction in a non-autoregressive fashion to minimize time overhead.
Generating longer drafts can lead to even more significant speedups once
verified, but also incurs substantial trial and error costs if it fails.
Suffering from the high verification failure probability, existing decoding
methods cannot draft too much content for verification at one time, achieving
sub-optimal inference acceleration. In this paper, we introduce Ouroboros,
which constructs a phrase candidate pool from the verification process of LLMs
to provide candidates for draft generation of the small model. Thereby,
Ouroboros can further improve the efficiency and effectiveness of the initial
drafts. The experimental results on typical text generation tasks show that
Ouroboros achieves speedups of up to 1.9x and 2.8x compared to lookahead
decoding and speculative decoding, respectively. The source code of Ouroboros
is available at https://github.com/thunlp/Ouroboros.