Domino：在推測解碼中將因果建模與自回歸草稿生成解耦

摘要

推測解碼透過草擬多個令牌並利用目標模型平行驗證，來加速大型語言模型的推論。然而，其實際加速效果受到草擬品質與草擬成本之間的權衡限制：自迴歸草擬器雖能建模草擬令牌間的因果依賴，但會引入序列化開銷；而平行草擬器雖降低草擬成本，卻削弱了區塊內部的依賴建模能力。本文提出 Domino 框架，一種將因果依賴建模與昂貴的自迴歸草擬執行解耦的推測解碼方法。Domino 首先使用平行草擬骨架產生整個區塊的初步草擬分佈，再透過輕量的 Domino 頭模塊，以依賴前綴的因果資訊對其進行精煉。為穩定教師強制因果編碼，我們進一步引入基底錨定訓練課程，先強化平行骨架，再逐步將最佳化轉向經因果修正後的最終分佈。在 Qwen3 模型上的實驗顯示，Domino 在 Transformers 後端下可達最高 \(5.49\times\) 的端到端加速，在 SGLang 服務下可達最高 \(5.8\times\) 的吞吐量加速。

English

Speculative decoding accelerates LLM inference by drafting multiple tokens and verifying them in parallel with the target model. However, its practical speedup is constrained by the trade-off between draft quality and drafting cost: autoregressive drafters model causal dependencies among draft tokens but incur sequential overhead, while parallel drafters reduce drafting cost but weaken intra-block dependency modeling. In this paper, we propose Domino, a speculative decoding framework that decouples causal dependency modeling from expensive autoregressive draft execution. Domino first uses a parallel draft backbone to produce preliminary draft distributions for the entire block, and then applies a lightweight Domino head to refine them with prefix-dependent causal information. To stabilize teacher-forced causal encoding, we further introduce a base-anchored training curriculum that first strengthens the parallel backbone and then gradually shifts optimization toward the causally corrected final distribution. Experiments on Qwen3 models show that Domino achieves up to \(5.49\times\) end-to-end speedup under the Transformers backend and up to \(5.8\times\) throughput speedup under SGLang serving.