袋鼠：通过双早期退出实现无损自我推测解码

摘要

推测解码已经证明了其在加速大型语言模型推理过程中的有效性，同时保持了一致的采样分布。然而，传统的训练单独的草稿模型以实现令人满意的标记接受率的方法可能成本高昂。受早期退出的启发，我们提出了一种新颖的自我推测解码框架 Kangaroo，它使用一个固定的浅层子网络作为自草稿模型，剩余的层作为更大的目标模型。我们在子网络顶部训练了一个轻量级高效的适配器模块，以弥合子网络和完整模型的表示能力之间的差距。值得注意的是，自草稿模型的推理延迟可能与大模型相比不再可以忽略，因此需要采取策略来提高标记接受率，同时最小化小模型的草拟步骤。为了解决这一挑战，我们引入了一个额外的早期退出机制来生成草稿标记。具体而言，在草拟阶段，一旦当前标记的置信水平低于一定阈值，我们就会停止小模型的后续预测。在 Spec-Bench 上进行的大量实验表明了 Kangaroo 的有效性。在单序列验证下，Kangaroo 在 Spec-Bench 上实现了高达 1.68 倍的加速，胜过 Medusa-1，且额外参数数量减少了 88.7\%（67M 对比 591M）。Kangaroo 的代码可在 https://github.com/Equationliu/Kangaroo 获取。

English

Speculative decoding has demonstrated its effectiveness in accelerating the inference of large language models while maintaining a consistent sampling distribution. However, the conventional approach of training a separate draft model to achieve a satisfactory token acceptance rate can be costly. Drawing inspiration from early exiting, we propose a novel self-speculative decoding framework Kangaroo, which uses a fixed shallow sub-network as a self-draft model, with the remaining layers serving as the larger target model. We train a lightweight and efficient adapter module on top of the sub-network to bridge the gap between the sub-network and the full model's representation ability. It is noteworthy that the inference latency of the self-draft model may no longer be negligible compared to the large model, necessitating strategies to increase the token acceptance rate while minimizing the drafting steps of the small model. To address this challenge, we introduce an additional early exiting mechanism for generating draft tokens. Specifically, we halt the small model's subsequent prediction during the drafting phase once the confidence level for the current token falls below a certain threshold. Extensive experiments on the Spec-Bench demonstrate the effectiveness of Kangaroo. Under single-sequence verification, Kangaroo achieves speedups up to 1.68times on Spec-Bench, outperforming Medusa-1 with 88.7\% fewer additional parameters (67M compared to 591M). The code for Kangaroo is available at https://github.com/Equationliu/Kangaroo.

袋鼠：通过双早期退出实现无损自我推测解码

Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting

摘要

Summary

Support

Support