ChatPaper.aiChatPaper

Draft-OPD:基於同策略蒸餾的推測性草稿模型

Draft-OPD: On-Policy Distillation for Speculative Draft Models

May 28, 2026
作者: Haodi Lei, Yafy Li, Haoran Zhang, Shunkai Zhang, Qianjia Cheng, Xiaoye Qu, Ganqu Cui, Bowen Zhou, Ning Ding, Yun Luo, Yu Cheng
cs.AI

摘要

投機解碼透過將目標模型與輕量級草稿模型配對,並對其提出的詞元進行並行驗證,從而加速大型語言模型的推論。目前常見的草稿模型構建方式(如EAGLE3或DFlash)是對目標模型生成的軌跡進行監督式微調(SFT)。然而,我們觀察到SFT很快就會達到平台期:草稿模型在測試資料上的接受長度停止改善。其原因在於「離線到推論的不匹配」:在SFT中,草稿器從固定的目標生成軌跡中學習,但在投機解碼過程中,它需要對根據自身策略所提出的區塊進行評估。這促使了在策略蒸餾(OPD)的出現,即由目標模型在草稿誘發的狀態下監督草稿器。然而,OPD對草稿模型而言仍然困難,因為它們無法可靠地獨立展開完整序列,而目標輔助生成會使收集的序列遵循目標分佈,從而消除在策略訊號。為此,我們提出Draft-OPD,它利用目標輔助展開來產生穩定的續接,並對驗證暴露的錯誤位置進行草稿重播。這使得草稿器能從目標模型對接受與拒絕提案的雙重回饋中學習,將訓練聚焦於那些限制投機接受的草稿誘發錯誤。實驗結果顯示,Draft-OPD在各種任務中對推理模型實現了超過5倍的無損加速,相較EAGLE-3和DFlash分別提升了23%和13%。
English
Speculative decoding accelerates large language model inference by pairing a target model with a lightweight draft model whose proposed tokens are verified in parallel. A common way to build draft models, like EAGLE3 or DFlash is supervised fine-tuning (SFT) on target-generated trajectories. However, we observe that SFT quickly plateaus: the draft model's acceptance length on test data stops improving. The reason is an offline-to-inference mismatch: In SFT, the drafter learns from fixed target-generated trajectories, whereas during speculative decoding it is evaluated on blocks proposed under its own policy. This motivates on-policy distillation (OPD), where the target model supervises the drafter on draft-induced states. Yet OPD remains difficult for draft models, as they cannot reliably roll out complete sequences independently, whereas target-assisted generation makes the collected sequences follow the target distribution and thus eliminates the on-policy signal. We therefore propose Draft-OPD, which uses target-assisted rollout for stable continuations and replays drafting from the verification-exposed error positions. This allows the drafter to learn from target feedback on both accepted and rejected proposals, focusing training on the draft-induced errors that limit speculative acceptance. Experiments show that Draft-OPD achieves over 5times lossless acceleration for thinking models across diverse tasks, improving over EAGLE-3 and DFlash by 23\% and 13\%.