Domino: 추측 디코딩에서 자기회귀적 초안 생성과 인과 모델링의 분리

초록

추론적 디코딩은 여러 개의 초안 토큰을 생성하고 이를 대상 모델과 병렬로 검증함으로써 대규모 언어 모델(LLM) 추론을 가속화한다. 그러나 실제 속도 향상은 초안 품질과 초안 생성 비용 간의 상충 관계에 의해 제약을 받는다. 자기회귀적 초안 생성기는 초안 토큰 간의 인과적 의존성을 모델링하지만 순차적 오버헤드를 수반하는 반면, 병렬 초안 생성기는 초안 생성 비용을 줄이지만 블록 내 의존성 모델링을 약화시킨다. 본 논문에서는 인과적 의존성 모델링을 비용이 많이 드는 자기회귀적 초안 실행으로부터 분리하는 추론적 디코딩 프레임워크인 Domino를 제안한다. Domino는 먼저 병렬 초안 백본을 사용하여 전체 블록에 대한 초기 초안 분포를 생성한 다음, 경량 Domino 헤드를 적용하여 이를 접두사 의존적 인과 정보로 정제한다. 교사 강제 인과 인코딩을 안정화하기 위해, 먼저 병렬 백본을 강화한 후 점차 최적화를 인과적으로 보정된 최종 분포로 전환하는 기준 기반 훈련 커리큘럼을 추가로 도입한다. Qwen3 모델에 대한 실험 결과, Domino는 Transformers 백엔드에서 최대 5.49배의 종단 간 속도 향상, SGLang 서빙에서 최대 5.8배의 처리량 속도 향상을 달성함을 보여준다.

English

Speculative decoding accelerates LLM inference by drafting multiple tokens and verifying them in parallel with the target model. However, its practical speedup is constrained by the trade-off between draft quality and drafting cost: autoregressive drafters model causal dependencies among draft tokens but incur sequential overhead, while parallel drafters reduce drafting cost but weaken intra-block dependency modeling. In this paper, we propose Domino, a speculative decoding framework that decouples causal dependency modeling from expensive autoregressive draft execution. Domino first uses a parallel draft backbone to produce preliminary draft distributions for the entire block, and then applies a lightweight Domino head to refine them with prefix-dependent causal information. To stabilize teacher-forced causal encoding, we further introduce a base-anchored training curriculum that first strengthens the parallel backbone and then gradually shifts optimization toward the causally corrected final distribution. Experiments on Qwen3 models show that Domino achieves up to \(5.49\times\) end-to-end speedup under the Transformers backend and up to \(5.8\times\) throughput speedup under SGLang serving.