カンガルー：ダブル早期終了によるロスレス自己推測デコーディング

要旨

推測デコードは、大規模言語モデルの推論を加速しつつ、一貫したサンプリング分布を維持する効果を実証してきました。しかし、満足のいくトークン受理率を達成するために別途ドラフトモデルを訓練する従来のアプローチは、コストがかかる場合があります。早期終了に着想を得て、我々は新しい自己推測デコードフレームワーク「Kangaroo」を提案します。これは、固定された浅いサブネットワークを自己ドラフトモデルとして使用し、残りの層をより大きなターゲットモデルとして機能させます。サブネットワークと完全モデルの表現能力のギャップを埋めるために、サブネットワーク上に軽量で効率的なアダプターモジュールを訓練します。注目すべきは、自己ドラフトモデルの推論遅延が大規模モデルと比較して無視できなくなる可能性があり、トークン受理率を増やしつつ小規模モデルのドラフトステップを最小化する戦略が必要となる点です。この課題に対処するため、ドラフトトークンを生成するための追加の早期終了メカニズムを導入します。具体的には、ドラフトフェーズ中に現在のトークンの信頼度が一定の閾値を下回った場合、小規模モデルのそれ以降の予測を停止します。Spec-Benchでの広範な実験により、Kangarooの有効性が実証されました。単一シーケンス検証の下で、KangarooはSpec-Benchにおいて最大1.68倍の高速化を達成し、Medusa-1を上回りながら追加パラメータを88.7％削減しました（67M対591M）。Kangarooのコードはhttps://github.com/Equationliu/Kangarooで公開されています。

English

Speculative decoding has demonstrated its effectiveness in accelerating the inference of large language models while maintaining a consistent sampling distribution. However, the conventional approach of training a separate draft model to achieve a satisfactory token acceptance rate can be costly. Drawing inspiration from early exiting, we propose a novel self-speculative decoding framework Kangaroo, which uses a fixed shallow sub-network as a self-draft model, with the remaining layers serving as the larger target model. We train a lightweight and efficient adapter module on top of the sub-network to bridge the gap between the sub-network and the full model's representation ability. It is noteworthy that the inference latency of the self-draft model may no longer be negligible compared to the large model, necessitating strategies to increase the token acceptance rate while minimizing the drafting steps of the small model. To address this challenge, we introduce an additional early exiting mechanism for generating draft tokens. Specifically, we halt the small model's subsequent prediction during the drafting phase once the confidence level for the current token falls below a certain threshold. Extensive experiments on the Spec-Bench demonstrate the effectiveness of Kangaroo. Under single-sequence verification, Kangaroo achieves speedups up to 1.68times on Spec-Bench, outperforming Medusa-1 with 88.7\% fewer additional parameters (67M compared to 591M). The code for Kangaroo is available at https://github.com/Equationliu/Kangaroo.

カンガルー：ダブル早期終了によるロスレス自己推測デコーディング

Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting

要旨

Support