TAPS: 추론적 샘플링을 위한 작업 인식 제안 분포

초록

추측 디코딩은 경량 드래프트 모델이 미래 토큰을 제안하고 더 큰 대상 모델이 이를 병렬로 검증하는 방식으로 자기회귀 생성 속도를 가속합니다. 그러나 실제로 드래프트 모델은 일반적으로 광범위한 일반 코퍼스로 훈련되어, 추측 디코딩 품질이 드래프트 훈련 분포에 얼마나 의존하는지 불분명합니다. 우리는 MathInstruct, ShareGPT 및 혼합 데이터 변형으로 훈련된 경량 HASS 및 EAGLE-2 드래프터를 사용하여 이 문제를 연구하고, MT-Bench, GSM8K, MATH-500 및 SVAMP에서 평가합니다. 수용 길이를 기준으로 측정했을 때, 작업 특화 훈련은 명확한 특수화를 가져옵니다: MathInstruct로 훈련된 드래프트는 추론 벤치마크에서 가장 강력한 반면, ShareGPT로 훈련된 드래프트는 MT-Bench에서 가장 강력합니다. 혼합 데이터 훈련은 견고성을 향상시키지만, 더 큰 혼합 데이터가 모든 디코딩 온도에서 우세하지는 않습니다. 또한 추론 시점에 특화된 드래프터를 결합하는 방법을 연구합니다. 단순 체크포인트 평균화는 성능이 낮은 반면, 신뢰도 기반 라우팅은 단일 도메인 드래프트를 개선하고 병합 트리 검증은 두 백본 모두에서 전반적으로 가장 높은 수용 길이를 달성합니다. 마지막으로, 신뢰도는 엔트로피보다 더 유용한 라우팅 신호입니다: 거부된 토큰은 일반적으로 더 높은 엔트로피를 가지지만, 신뢰도는 벤치마크 수준에서 훨씬 더 명확한 라우팅 결정을 생성합니다. 이러한 결과는 추측 디코딩 품질이 드래프트 아키텍처뿐만 아니라 드래프트 훈련 데이터와 다운스트림 워크로드 간의 일치에도 의존하며, 특화된 드래프터는 가중치 공간에서보다 추론 시점에 결합하는 것이 더 효과적임을 보여줍니다.

English

Speculative decoding accelerates autoregressive generation by letting a lightweight draft model propose future tokens that a larger target model then verifies in parallel. In practice, however, draft models are usually trained on broad generic corpora, which leaves it unclear how much speculative decoding quality depends on the draft training distribution. We study this question with lightweight HASS and EAGLE-2 drafters trained on MathInstruct, ShareGPT, and mixed-data variants, evaluated on MT-Bench, GSM8K, MATH-500, and SVAMP. Measured by acceptance length, task-specific training yields clear specialization: MathInstruct-trained drafts are strongest on reasoning benchmarks, while ShareGPT-trained drafts are strongest on MT-Bench. Mixed-data training improves robustness, but larger mixtures do not dominate across decoding temperatures. We also study how to combine specialized drafters at inference time. Naive checkpoint averaging performs poorly, whereas confidence-based routing improves over single-domain drafts and merged-tree verification yields the highest acceptance length overall for both backbones. Finally, confidence is a more useful routing signal than entropy: rejected tokens tend to have higher entropy, but confidence produces much clearer benchmark-level routing decisions. These results show that speculative decoding quality depends not only on draft architecture, but also on the match between draft training data and downstream workload, and that specialized drafters are better combined at inference time than in weight space.

TAPS: 추론적 샘플링을 위한 작업 인식 제안 분포

TAPS: Task Aware Proposal Distributions for Speculative Sampling

초록

Support