Ouroboros：具有大型模型增強起草的推理解碼

摘要

草擬-驗證解碼方法，如猜測性解碼，是廣泛採用的無需訓練的方法，用於加速大型語言模型（LLMs）的推理。與使用自回歸過程來按順序解碼標記不同，猜測性解碼首先使用高效的小型模型創建草稿。然後，LLMs需要以非自回歸方式進行驗證和修正，以最小化時間開銷。生成較長的草稿一旦經過驗證，可以導致更顯著的加速，但如果失敗，也會產生相當大的試誤成本。由於高驗證失敗概率的影響，現有的解碼方法無法一次為驗證起草太多內容，從而達不到最佳的推理加速。在本文中，我們介紹了Ouroboros，它從LLMs的驗證過程中構建短語候選池，為小型模型的草稿生成提供候選。因此，Ouroboros可以進一步提高初始草稿的效率和有效性。對典型文本生成任務的實驗結果顯示，Ouroboros相對於預視解碼和猜測性解碼，實現了高達1.9倍和2.8倍的加速。Ouroboros的源代碼可在https://github.com/thunlp/Ouroboros找到。

English

Drafting-then-verifying decoding methods such as speculative decoding are widely adopted training-free methods to accelerate the inference of large language models (LLMs). Instead of employing an autoregressive process to decode tokens sequentially, speculative decoding initially creates drafts with an efficient small model. Then LLMs are required to conduct verification and correction in a non-autoregressive fashion to minimize time overhead. Generating longer drafts can lead to even more significant speedups once verified, but also incurs substantial trial and error costs if it fails. Suffering from the high verification failure probability, existing decoding methods cannot draft too much content for verification at one time, achieving sub-optimal inference acceleration. In this paper, we introduce Ouroboros, which constructs a phrase candidate pool from the verification process of LLMs to provide candidates for draft generation of the small model. Thereby, Ouroboros can further improve the efficiency and effectiveness of the initial drafts. The experimental results on typical text generation tasks show that Ouroboros achieves speedups of up to 1.9x and 2.8x compared to lookahead decoding and speculative decoding, respectively. The source code of Ouroboros is available at https://github.com/thunlp/Ouroboros.

Ouroboros：具有大型模型增強起草的推理解碼

Ouroboros: Speculative Decoding with Large Model Enhanced Drafting

摘要

Summary

Support

Support