Draft-OPD: 스펙큘레이티브 드래프트 모델을 위한 온-정책 증류

초록

추측적 디코딩은 대상 모델(target model)과 가벼운 초안 모델(draft model)을 짝지어 대규모 언어 모델 추론을 가속화하며, 초안 모델이 제안한 토큰들은 병렬로 검증된다. EAGLE3 또는 DFlash와 같은 초안 모델을 구축하는 일반적인 방법은 대상 모델이 생성한 궤적(trajectory)에 대한 지도 미세 조정(SFT)이다. 그러나 우리는 SFT가 빠르게 정체됨을 관찰한다. 즉, 테스트 데이터에 대한 초안 모델의 수용 길이(acceptance length)가 더 이상 개선되지 않는다. 그 이유는 오프라인-추론 불일치(offline-to-inference mismatch) 때문이다: SFT에서는 초안 모델이 고정된 대상 모델 생성 궤적으로부터 학습하는 반면, 추측적 디코딩 중에는 자체 정책(self-policy)으로 제안된 블록에 대해 평가된다. 이는 초안 모델이 초안 생성 상태(draft-induced state)에서 대상 모델의 감독을 받는 온-정책 증류(OPD, on-policy distillation)의 동기를 부여한다. 그러나 OPD는 초안 모델에게 여전히 어려운데, 이는 초안 모델이 독립적으로 완전한 시퀀스를 안정적으로 전개(roll out)할 수 없는 반면, 대상 모델 지원 생성(target-assisted generation)은 수집된 시퀀스가 대상 분포를 따르도록 만들어 온-정책 신호(on-policy signal)를 제거하기 때문이다. 따라서 우리는 Draft-OPD를 제안한다. 이는 안정적인 연속 생성을 위해 대상 모델 지원 전개(target-assisted rollout)를 사용하고, 검증 과정에서 노출된 오류 위치로부터 초안 작성을 재시도(replay)한다. 이를 통해 초안 모델은 수용 및 거절된 제안 모두에 대해 대상 모델의 피드백으로 학습할 수 있으며, 추측적 수용을 제한하는 초안 생성 오류에 훈련을 집중시킨다. 실험 결과, Draft-OPD는 다양한 작업에서 추론 모델(thinking model)에 대해 5배 이상의 무손실 가속(lossless acceleration)을 달성하며, EAGLE-3 및 DFlash 대비 각각 23% 및 13% 향상된 성능을 보인다.

English

Speculative decoding accelerates large language model inference by pairing a target model with a lightweight draft model whose proposed tokens are verified in parallel. A common way to build draft models, like EAGLE3 or DFlash is supervised fine-tuning (SFT) on target-generated trajectories. However, we observe that SFT quickly plateaus: the draft model's acceptance length on test data stops improving. The reason is an offline-to-inference mismatch: In SFT, the drafter learns from fixed target-generated trajectories, whereas during speculative decoding it is evaluated on blocks proposed under its own policy. This motivates on-policy distillation (OPD), where the target model supervises the drafter on draft-induced states. Yet OPD remains difficult for draft models, as they cannot reliably roll out complete sequences independently, whereas target-assisted generation makes the collected sequences follow the target distribution and thus eliminates the on-policy signal. We therefore propose Draft-OPD, which uses target-assisted rollout for stable continuations and replays drafting from the verification-exposed error positions. This allows the drafter to learn from target feedback on both accepted and rejected proposals, focusing training on the draft-induced errors that limit speculative acceptance. Experiments show that Draft-OPD achieves over 5times lossless acceleration for thinking models across diverse tasks, improving over EAGLE-3 and DFlash by 23\% and 13\%.