AdaSPEC:面向高效推理性解碼器的選擇性知識蒸餾技術
AdaSPEC: Selective Knowledge Distillation for Efficient Speculative Decoders
October 22, 2025
作者: Yuezhou Hu, Jiaxin Guo, Xinyu Feng, Tuo Zhao
cs.AI
摘要
推測解碼(Speculative Decoding, SD)透過採用小型草稿模型生成預測,並由大型目標模型進行驗證,從而加速大型語言模型的推理過程。SD的效能取決於兩模型間的對齊程度,此對齊通常透過知識蒸餾(Knowledge Distillation, KD)來強化。然而,傳統KD方法旨在最小化草稿模型與目標模型在所有詞元上的KL散度,此目標與SD最大化詞元接受率的真實目標存在偏差。由於草稿模型的容量限制,往往難以完全吸收目標模型的知識,導致效能未達最佳狀態。
為解決此問題,我們提出AdaSPEC創新方法,將選擇性詞元過濾機制引入KD流程。AdaSPEC利用參考模型識別並過濾難以擬合的詞元,使蒸餾過程能專注於讓草稿模型在較簡單的詞元上與目標模型更好對齊。此方法在維持生成品質的同時,有效提升整體詞元接受率。我們在算術推理、指令跟隨、程式編碼及文本摘要等多樣任務中進行評估,採用31M/1.4B與350M/2.7B兩種參數規模的模型配置。實驗結果表明,AdaSPEC在所有任務中均穩定優化現有最先進的DistillSpec方法,詞元接受率最高提升達15%。相關程式碼已公開於:https://github.com/yuezhouhu/adaspec。
English
Speculative Decoding (SD) accelerates large language model inference by
employing a small draft model to generate predictions, which are then verified
by a larger target model. The effectiveness of SD hinges on the alignment
between these models, which is typically enhanced by Knowledge Distillation
(KD). However, conventional KD methods aim to minimize the KL divergence
between the draft and target models across all tokens, a goal that is
misaligned with the true objective of SD, which is to maximize token acceptance
rate. Therefore, draft models often struggle to fully assimilate the target
model's knowledge due to capacity constraints, leading to suboptimal
performance. To address this challenge, we propose AdaSPEC, a novel method that
incorporates selective token filtering into the KD process. AdaSPEC utilizes a
reference model to identify and filter out difficult-to-fit tokens, enabling
the distillation of a draft model that better aligns with the target model on
simpler tokens. This approach improves the overall token acceptance rate
without compromising generation quality. We evaluate AdaSPEC across diverse
tasks, including arithmetic reasoning, instruction-following, coding, and
summarization, using model configurations of 31M/1.4B and 350M/2.7B parameters.
Our results demonstrate that AdaSPEC consistently outperforms the
state-of-the-art DistillSpec method, achieving higher acceptance rates across
all tasks (up to 15\%). The code is publicly available at
https://github.com/yuezhouhu/adaspec.