AdaSPEC: 効率的な投機的デコーダのための選択的知識蒸留

要旨

Speculative Decoding（SD）は、小さなドラフトモデルを用いて予測を生成し、それを大規模なターゲットモデルで検証することで、大規模言語モデルの推論を高速化する技術である。SDの効果は、これらのモデル間の整合性に依存しており、一般的にKnowledge Distillation（KD）によって強化される。しかし、従来のKD手法は、すべてのトークンにおいてドラフトモデルとターゲットモデルのKLダイバージェンスを最小化することを目的としており、これはSDの真の目的であるトークン受理率の最大化と整合していない。そのため、ドラフトモデルは容量制約のためターゲットモデルの知識を十分に吸収できず、性能が最適化されないことが多い。この課題に対処するため、我々はKDプロセスに選択的トークンフィルタリングを組み込んだ新手法AdaSPECを提案する。AdaSPECは参照モデルを利用して適合困難なトークンを特定・除去し、より単純なトークンにおいてターゲットモデルとの整合性が高いドラフトモデルの蒸留を可能にする。このアプローチにより、生成品質を損なうことなく、全体的なトークン受理率が向上する。算術推論、指示追従、コード生成、要約など多様なタスクにおいて、31M/1.4Bおよび350M/2.7Bパラメータのモデル構成を用いてAdaSPECを評価した。その結果、AdaSPECは最先端のDistillSpec手法を一貫して上回り、すべてのタスクで最大15%の受理率向上を達成した。コードはhttps://github.com/yuezhouhu/adaspecで公開されている。

English

Speculative Decoding (SD) accelerates large language model inference by employing a small draft model to generate predictions, which are then verified by a larger target model. The effectiveness of SD hinges on the alignment between these models, which is typically enhanced by Knowledge Distillation (KD). However, conventional KD methods aim to minimize the KL divergence between the draft and target models across all tokens, a goal that is misaligned with the true objective of SD, which is to maximize token acceptance rate. Therefore, draft models often struggle to fully assimilate the target model's knowledge due to capacity constraints, leading to suboptimal performance. To address this challenge, we propose AdaSPEC, a novel method that incorporates selective token filtering into the KD process. AdaSPEC utilizes a reference model to identify and filter out difficult-to-fit tokens, enabling the distillation of a draft model that better aligns with the target model on simpler tokens. This approach improves the overall token acceptance rate without compromising generation quality. We evaluate AdaSPEC across diverse tasks, including arithmetic reasoning, instruction-following, coding, and summarization, using model configurations of 31M/1.4B and 350M/2.7B parameters. Our results demonstrate that AdaSPEC consistently outperforms the state-of-the-art DistillSpec method, achieving higher acceptance rates across all tasks (up to 15\%). The code is publicly available at https://github.com/yuezhouhu/adaspec.

AdaSPEC: 効率的な投機的デコーダのための選択的知識蒸留

AdaSPEC: Selective Knowledge Distillation for Efficient Speculative Decoders

要旨

Support