CLaSp:自推测解码中的上下文层跳跃机制
CLaSp: In-Context Layer Skip for Self-Speculative Decoding
May 30, 2025
作者: Longze Chen, Renke Shan, Huiming Wang, Lu Wang, Ziqiang Liu, Run Luo, Jiawei Wang, Hamid Alinejad-Rokny, Min Yang
cs.AI
摘要
推测解码(Speculative Decoding, SD)是一种有望加速大型语言模型(LLMs)解码过程的方法。SD的效率主要取决于草稿模型与验证模型之间的一致性。然而,现有的草稿生成方法通常需要额外训练模块,这在实现和确保跨多种LLMs的兼容性方面存在挑战。本文提出了一种上下文层跳跃策略CLaSp,用于自推测解码。与先前方法不同,CLaSp无需额外的草稿模块或额外训练,而是通过跳过验证模型的中间层来构建压缩的草稿模型,采用即插即用机制。具体而言,我们开发了一种动态规划算法,该算法利用上一验证阶段的完整隐藏状态作为目标,优化层跳跃过程。这使得CLaSp能够在每次验证后动态调整其层跳跃策略,而无需依赖预先优化的跳跃层集合。在多种下游任务上的实验结果表明,CLaSp在LLaMA3系列模型上实现了1.3倍至1.7倍的加速,且未改变生成文本的原始分布。
English
Speculative decoding (SD) is a promising method for accelerating the decoding
process of Large Language Models (LLMs). The efficiency of SD primarily hinges
on the consistency between the draft model and the verify model. However,
existing drafting approaches typically require additional modules to be
trained, which can be challenging to implement and ensure compatibility across
various LLMs. In this paper, we propose CLaSp, an in-context layer-skipping
strategy for self-speculative decoding. Unlike prior methods, CLaSp does not
require additional drafting modules or extra training. Instead, it employs a
plug-and-play mechanism by skipping intermediate layers of the verify model to
construct a compressed draft model. Specifically, we develop a dynamic
programming algorithm that optimizes the layer-skipping process by leveraging
the complete hidden states from the last verification stage as an objective.
This enables CLaSp to dynamically adjust its layer-skipping strategy after each
verification stage, without relying on pre-optimized sets of skipped layers.
Experimental results across diverse downstream tasks demonstrate that CLaSp
achieves a speedup of 1.3x ~ 1.7x on LLaMA3 series models without altering the
original distribution of the generated text.Summary
AI-Generated Summary