TAPS:推理性抽樣的任務感知提案分佈
TAPS: Task Aware Proposal Distributions for Speculative Sampling
March 27, 2026
作者: Mohamad Zbib, Mohamad Bazzi, Ammar Mohanna, Hasan Abed Al Kader Hammoud, Bernard Ghanem
cs.AI
摘要
推測解碼技術透過讓輕量級草稿模型預測未來的詞元,並由大型目標模型進行並行驗證,從而加速自迴歸生成。然而在實際應用中,草稿模型通常基於廣泛的通用語料庫進行訓練,這使得推測解碼的質量在多大程度上依賴於草稿模型的訓練分佈仍不明確。我們針對此問題展開研究,使用基於MathInstruct、ShareGPT及混合數據變體訓練的輕量級HASS與EAGLE-2草稿模型,並在MT-Bench、GSM8K、MATH-500和SVAMP數據集上進行評估。從接受長度指標來看,任務特定訓練呈現出明顯的專業化特徵:基於MathInstruct訓練的草稿模型在推理基準測試中表現最強,而ShareGPT訓練的草稿模型在MT-Bench上最優。混合數據訓練能提升魯棒性,但更大規模的混合數據並未在所有解碼溫度下佔據主導地位。我們還研究了如何在推理時組合專業化草稿模型:簡單的檢查點平均法效果不佳,而基於置信度的路由策略優於單領域草稿模型,合併樹驗證法則在兩種骨幹模型上均實現了最高的總接受長度。最後,置信度相比熵是更有效的路由信號:被拒絕的詞元往往具有更高熵值,但置信度能產生更清晰的基準級路由決策。這些結果表明,推測解碼質量不僅取決於草稿模型架構,更關鍵在於草稿訓練數據與下游工作負載的匹配度,且專業化草稿模型在推理時進行組合的效果優於權重空間的融合。
English
Speculative decoding accelerates autoregressive generation by letting a lightweight draft model propose future tokens that a larger target model then verifies in parallel. In practice, however, draft models are usually trained on broad generic corpora, which leaves it unclear how much speculative decoding quality depends on the draft training distribution. We study this question with lightweight HASS and EAGLE-2 drafters trained on MathInstruct, ShareGPT, and mixed-data variants, evaluated on MT-Bench, GSM8K, MATH-500, and SVAMP. Measured by acceptance length, task-specific training yields clear specialization: MathInstruct-trained drafts are strongest on reasoning benchmarks, while ShareGPT-trained drafts are strongest on MT-Bench. Mixed-data training improves robustness, but larger mixtures do not dominate across decoding temperatures. We also study how to combine specialized drafters at inference time. Naive checkpoint averaging performs poorly, whereas confidence-based routing improves over single-domain drafts and merged-tree verification yields the highest acceptance length overall for both backbones. Finally, confidence is a more useful routing signal than entropy: rejected tokens tend to have higher entropy, but confidence produces much clearer benchmark-level routing decisions. These results show that speculative decoding quality depends not only on draft architecture, but also on the match between draft training data and downstream workload, and that specialized drafters are better combined at inference time than in weight space.