CLaSp：自推测解码中的上下文层跳跃机制

摘要

推测解码（Speculative Decoding, SD）是一种有望加速大型语言模型（LLMs）解码过程的方法。SD的效率主要取决于草稿模型与验证模型之间的一致性。然而，现有的草稿生成方法通常需要额外训练模块，这在实现和确保跨多种LLMs的兼容性方面存在挑战。本文提出了一种上下文层跳跃策略CLaSp，用于自推测解码。与先前方法不同，CLaSp无需额外的草稿模块或额外训练，而是通过跳过验证模型的中间层来构建压缩的草稿模型，采用即插即用机制。具体而言，我们开发了一种动态规划算法，该算法利用上一验证阶段的完整隐藏状态作为目标，优化层跳跃过程。这使得CLaSp能够在每次验证后动态调整其层跳跃策略，而无需依赖预先优化的跳跃层集合。在多种下游任务上的实验结果表明，CLaSp在LLaMA3系列模型上实现了1.3倍至1.7倍的加速，且未改变生成文本的原始分布。

English

Speculative decoding (SD) is a promising method for accelerating the decoding process of Large Language Models (LLMs). The efficiency of SD primarily hinges on the consistency between the draft model and the verify model. However, existing drafting approaches typically require additional modules to be trained, which can be challenging to implement and ensure compatibility across various LLMs. In this paper, we propose CLaSp, an in-context layer-skipping strategy for self-speculative decoding. Unlike prior methods, CLaSp does not require additional drafting modules or extra training. Instead, it employs a plug-and-play mechanism by skipping intermediate layers of the verify model to construct a compressed draft model. Specifically, we develop a dynamic programming algorithm that optimizes the layer-skipping process by leveraging the complete hidden states from the last verification stage as an objective. This enables CLaSp to dynamically adjust its layer-skipping strategy after each verification stage, without relying on pre-optimized sets of skipped layers. Experimental results across diverse downstream tasks demonstrate that CLaSp achieves a speedup of 1.3x ~ 1.7x on LLaMA3 series models without altering the original distribution of the generated text.

CLaSp：自推测解码中的上下文层跳跃机制

CLaSp: In-Context Layer Skip for Self-Speculative Decoding

摘要

Support