Draft-OPD：针对推测草稿模型在策略蒸馏

摘要

推测解码通过将目标模型与轻量级草案模型配对来实现大语言模型推理加速，草案模型生成的token将被并行验证。构建草案模型的常见方法（如EAGLE3或DFlash）是在目标模型生成的轨迹上进行监督微调。然而，我们观察到监督微调很快达到瓶颈：草案模型在测试数据上的接受长度停止提升。原因在于离线训练与推理阶段存在不匹配：监督微调中，草案从固定的目标生成轨迹中学习，而推测解码时它是在自身策略生成的区块上被评估。这促使我们采用策略内蒸馏（OPD），即让目标模型在草案引发的状态上对草案进行监督。但策略内蒸馏对草案模型仍具挑战性，因为它们无法独立可靠地生成完整序列，而目标辅助生成会导致采集的序列遵循目标分布，从而消除策略内信号。为此，我们提出Draft-OPD方法，该方法利用目标辅助生成实现稳定续写，并从验证暴露的错误位置重放草案生成过程。这使得草案能够同时从被接受和被拒绝的提案中学习目标反馈，将训练聚焦于限制推测接受率的草案引发错误。实验表明，Draft-OPD在各类任务上实现思维模型超过5倍的无损加速，相比EAGLE-3和DFlash分别提升23%和13%。

English

Speculative decoding accelerates large language model inference by pairing a target model with a lightweight draft model whose proposed tokens are verified in parallel. A common way to build draft models, like EAGLE3 or DFlash is supervised fine-tuning (SFT) on target-generated trajectories. However, we observe that SFT quickly plateaus: the draft model's acceptance length on test data stops improving. The reason is an offline-to-inference mismatch: In SFT, the drafter learns from fixed target-generated trajectories, whereas during speculative decoding it is evaluated on blocks proposed under its own policy. This motivates on-policy distillation (OPD), where the target model supervises the drafter on draft-induced states. Yet OPD remains difficult for draft models, as they cannot reliably roll out complete sequences independently, whereas target-assisted generation makes the collected sequences follow the target distribution and thus eliminates the on-policy signal. We therefore propose Draft-OPD, which uses target-assisted rollout for stable continuations and replays drafting from the verification-exposed error positions. This allows the drafter to learn from target feedback on both accepted and rejected proposals, focusing training on the draft-induced errors that limit speculative acceptance. Experiments show that Draft-OPD achieves over 5times lossless acceleration for thinking models across diverse tasks, improving over EAGLE-3 and DFlash by 23\% and 13\%.