ChatPaper.aiChatPaper

TAPS:面向推测性采样的任务感知提议分布

TAPS: Task Aware Proposal Distributions for Speculative Sampling

March 27, 2026
作者: Mohamad Zbib, Mohamad Bazzi, Ammar Mohanna, Hasan Abed Al Kader Hammoud, Bernard Ghanem
cs.AI

摘要

推测解码通过让轻量级草稿模型预测未来词元,再由大型目标模型并行验证的方式加速自回归生成。然而实践中,草稿模型通常基于宽泛的通用语料训练,这使推测解码的质量在多大程度上依赖于草稿模型的训练分布变得不明确。我们通过使用基于MathInstruct、ShareGPT及混合数据变体训练的轻量级HASS和EAGLE-2草稿模型,在MT-Bench、GSM8K、MATH-500和SVAMP基准上进行评估来研究该问题。以接受长度为衡量标准,任务特定训练展现出明显的专业化特征:基于MathInstruct训练的草稿模型在推理类基准上表现最强,而基于ShareGPT训练的草稿模型在MT-Bench上最优。混合数据训练能提升鲁棒性,但更大规模的混合数据并未在所有解码温度下占据优势。我们还研究了如何在推理时组合专业化草稿模型:简单的检查点平均方法效果不佳,而基于置信度的路由策略优于单领域草稿模型,合并树验证则在两种骨干模型上均实现了最高的总体接受长度。最后,置信度是比熵更有效的路由信号:被拒绝的词元往往具有更高熵值,但置信度能在基准层面产生更清晰的路由决策。这些结果表明,推测解码质量不仅取决于草稿模型架构,更依赖于草稿训练数据与下游任务的匹配程度,且专业化草稿模型在推理时组合的效果优于在权重空间融合。
English
Speculative decoding accelerates autoregressive generation by letting a lightweight draft model propose future tokens that a larger target model then verifies in parallel. In practice, however, draft models are usually trained on broad generic corpora, which leaves it unclear how much speculative decoding quality depends on the draft training distribution. We study this question with lightweight HASS and EAGLE-2 drafters trained on MathInstruct, ShareGPT, and mixed-data variants, evaluated on MT-Bench, GSM8K, MATH-500, and SVAMP. Measured by acceptance length, task-specific training yields clear specialization: MathInstruct-trained drafts are strongest on reasoning benchmarks, while ShareGPT-trained drafts are strongest on MT-Bench. Mixed-data training improves robustness, but larger mixtures do not dominate across decoding temperatures. We also study how to combine specialized drafters at inference time. Naive checkpoint averaging performs poorly, whereas confidence-based routing improves over single-domain drafts and merged-tree verification yields the highest acceptance length overall for both backbones. Finally, confidence is a more useful routing signal than entropy: rejected tokens tend to have higher entropy, but confidence produces much clearer benchmark-level routing decisions. These results show that speculative decoding quality depends not only on draft architecture, but also on the match between draft training data and downstream workload, and that specialized drafters are better combined at inference time than in weight space.
PDF1151April 1, 2026