CLaSp: 自己推論デコーディングのためのインコンテキスト層スキップ

要旨

推測的デコード（Speculative Decoding, SD）は、大規模言語モデル（LLMs）のデコードプロセスを加速する有望な手法です。SDの効率性は、主にドラフトモデルと検証モデルの一貫性に依存します。しかし、既存のドラフト手法では、通常、追加のモジュールを訓練する必要があり、これは実装が難しく、さまざまなLLMs間での互換性を確保することが課題となります。本論文では、自己推測的デコードのための文脈内レイヤースキップ戦略であるCLaSpを提案します。従来の手法とは異なり、CLaSpは追加のドラフトモジュールや追加の訓練を必要としません。代わりに、検証モデルの中間レイヤーをスキップすることで、圧縮されたドラフトモデルを構築するプラグアンドプレイメカニズムを採用します。具体的には、前回の検証段階からの完全な隠れ状態を目的関数として活用し、レイヤースキッププロセスを最適化する動的計画法アルゴリズムを開発しました。これにより、CLaSpは事前に最適化されたスキップレイヤーのセットに依存することなく、各検証段階後にレイヤースキップ戦略を動的に調整できます。多様な下流タスクでの実験結果は、CLaSpがLLaMA3シリーズモデルにおいて、生成テキストの元の分布を変更することなく、1.3倍から1.7倍の高速化を達成することを示しています。

English

Speculative decoding (SD) is a promising method for accelerating the decoding process of Large Language Models (LLMs). The efficiency of SD primarily hinges on the consistency between the draft model and the verify model. However, existing drafting approaches typically require additional modules to be trained, which can be challenging to implement and ensure compatibility across various LLMs. In this paper, we propose CLaSp, an in-context layer-skipping strategy for self-speculative decoding. Unlike prior methods, CLaSp does not require additional drafting modules or extra training. Instead, it employs a plug-and-play mechanism by skipping intermediate layers of the verify model to construct a compressed draft model. Specifically, we develop a dynamic programming algorithm that optimizes the layer-skipping process by leveraging the complete hidden states from the last verification stage as an objective. This enables CLaSp to dynamically adjust its layer-skipping strategy after each verification stage, without relying on pre-optimized sets of skipped layers. Experimental results across diverse downstream tasks demonstrate that CLaSp achieves a speedup of 1.3x ~ 1.7x on LLaMA3 series models without altering the original distribution of the generated text.

CLaSp: 自己推論デコーディングのためのインコンテキスト層スキップ

CLaSp: In-Context Layer Skip for Self-Speculative Decoding

要旨

Support