Domino: 推测解码中因果建模与自回归草稿生成的解耦

摘要

推测解码通过草拟多个令牌并与目标模型并行验证，从而加速大语言模型推理。然而，其实际加速效果受限于草稿质量与草拟成本之间的权衡：自回归草稿模型虽能建模草稿令牌间的因果依赖关系，但会引入顺序开销；并行草稿模型虽降低草拟成本，却削弱了块内依赖建模能力。本文提出Domino——一种将因果依赖建模与昂贵的自回归草稿执行相解耦的推测解码框架。Domino首先使用并行草稿骨干为整个块生成初步草稿分布，随后通过轻量级Domino头部利用前缀相关的因果信息对其进行修正。为稳定教师强制因果编码，我们进一步引入基于锚点的训练课程：先强化并行骨干，再逐步将优化重心转移至因果修正后的最终分布。在Qwen3模型上的实验表明，Domino在Transformers后端下可实现最高5.49倍的端到端加速，在SGLang服务下可实现最高5.8倍的吞吐量加速。

English

Speculative decoding accelerates LLM inference by drafting multiple tokens and verifying them in parallel with the target model. However, its practical speedup is constrained by the trade-off between draft quality and drafting cost: autoregressive drafters model causal dependencies among draft tokens but incur sequential overhead, while parallel drafters reduce drafting cost but weaken intra-block dependency modeling. In this paper, we propose Domino, a speculative decoding framework that decouples causal dependency modeling from expensive autoregressive draft execution. Domino first uses a parallel draft backbone to produce preliminary draft distributions for the entire block, and then applies a lightweight Domino head to refine them with prefix-dependent causal information. To stabilize teacher-forced causal encoding, we further introduce a base-anchored training curriculum that first strengthens the parallel backbone and then gradually shifts optimization toward the causally corrected final distribution. Experiments on Qwen3 models show that Domino achieves up to \(5.49\times\) end-to-end speedup under the Transformers backend and up to \(5.8\times\) throughput speedup under SGLang serving.