袋鼠：透過雙早期退出實現無損自我推測解碼

摘要

推測解碼已證明在加速大型語言模型的推論過程中保持一致的採樣分佈方面是有效的。然而，傳統方法是訓練一個獨立的草稿模型以達到滿意的標記接受率可能成本高昂。受早期退出的啟發，我們提出了一個新穎的自我推測解碼框架 Kangaroo，該框架使用一個固定的淺層子網路作為自我草稿模型，其餘層作為較大的目標模型。我們在子網路頂部訓練了一個輕量且高效的適配器模塊，以彌合子網路和完整模型的表示能力之間的差距。值得注意的是，自我草稿模型的推論延遲可能與大型模型相比不再可以忽略，因此需要增加標記接受率的策略，同時最大程度地減少小模型的草稿步驟。為應對這一挑戰，我們引入了一種額外的早期退出機制來生成草稿標記。具體來說，在草稿階段，一旦當前標記的信心水平低於一定閾值，我們將停止小模型的後續預測。在 Spec-Bench 上的大量實驗證明了 Kangaroo 的有效性。在單序列驗證下，Kangaroo 在 Spec-Bench 上實現了高達 1.68 倍的加速，勝過 Medusa-1，而額外參數數量減少了 88.7\%（67M 對比 591M）。Kangaroo 的代碼可在 https://github.com/Equationliu/Kangaroo 上找到。

English

Speculative decoding has demonstrated its effectiveness in accelerating the inference of large language models while maintaining a consistent sampling distribution. However, the conventional approach of training a separate draft model to achieve a satisfactory token acceptance rate can be costly. Drawing inspiration from early exiting, we propose a novel self-speculative decoding framework Kangaroo, which uses a fixed shallow sub-network as a self-draft model, with the remaining layers serving as the larger target model. We train a lightweight and efficient adapter module on top of the sub-network to bridge the gap between the sub-network and the full model's representation ability. It is noteworthy that the inference latency of the self-draft model may no longer be negligible compared to the large model, necessitating strategies to increase the token acceptance rate while minimizing the drafting steps of the small model. To address this challenge, we introduce an additional early exiting mechanism for generating draft tokens. Specifically, we halt the small model's subsequent prediction during the drafting phase once the confidence level for the current token falls below a certain threshold. Extensive experiments on the Spec-Bench demonstrate the effectiveness of Kangaroo. Under single-sequence verification, Kangaroo achieves speedups up to 1.68times on Spec-Bench, outperforming Medusa-1 with 88.7\% fewer additional parameters (67M compared to 591M). The code for Kangaroo is available at https://github.com/Equationliu/Kangaroo.

袋鼠：透過雙早期退出實現無損自我推測解碼

Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting

摘要

Support