CLaSp：自推理解碼中的上下文層跳躍機制

摘要

推測解碼（Speculative Decoding, SD）是一種加速大型語言模型（LLMs）解碼過程的潛力方法。SD的效率主要取決於草稿模型與驗證模型之間的一致性。然而，現有的草稿生成方法通常需要訓練額外的模組，這在實現和確保與各種LLMs的兼容性方面可能具有挑戰性。本文中，我們提出了CLaSp，一種用於自我推測解碼的上下文層跳過策略。與先前的方法不同，CLaSp無需額外的草稿模組或額外訓練，而是通過跳過驗證模型的中間層來構建壓縮的草稿模型，實現即插即用。具體而言，我們開發了一種動態規劃算法，該算法利用上一驗證階段的完整隱藏狀態作為目標，優化層跳過過程。這使得CLaSp能夠在每次驗證階段後動態調整其層跳過策略，而無需依賴預先優化的跳過層集合。在多樣的下游任務上的實驗結果表明，CLaSp在LLaMA3系列模型上實現了1.3倍至1.7倍的加速，且未改變生成文本的原始分佈。

English

Speculative decoding (SD) is a promising method for accelerating the decoding process of Large Language Models (LLMs). The efficiency of SD primarily hinges on the consistency between the draft model and the verify model. However, existing drafting approaches typically require additional modules to be trained, which can be challenging to implement and ensure compatibility across various LLMs. In this paper, we propose CLaSp, an in-context layer-skipping strategy for self-speculative decoding. Unlike prior methods, CLaSp does not require additional drafting modules or extra training. Instead, it employs a plug-and-play mechanism by skipping intermediate layers of the verify model to construct a compressed draft model. Specifically, we develop a dynamic programming algorithm that optimizes the layer-skipping process by leveraging the complete hidden states from the last verification stage as an objective. This enables CLaSp to dynamically adjust its layer-skipping strategy after each verification stage, without relying on pre-optimized sets of skipped layers. Experimental results across diverse downstream tasks demonstrate that CLaSp achieves a speedup of 1.3x ~ 1.7x on LLaMA3 series models without altering the original distribution of the generated text.

CLaSp：自推理解碼中的上下文層跳躍機制

CLaSp: In-Context Layer Skip for Self-Speculative Decoding

摘要

Support