Domino: 投機的デコードにおける自己回帰的ドラフティングからの因果モデリングの分離

要旨

投機的デコーディングは、複数のトークンをドラフトし、それらをターゲットモデルと並行して検証することでLLM推論を高速化する。しかし、その実用的な高速化効果は、ドラフト品質とドラフトコストのトレードオフによって制約される。すなわち、自己回帰型ドラフターはドラフトトークン間の因果依存関係をモデル化するが逐次オーバーヘッドを伴い、並列型ドラフターはドラフトコストを削減するもののブロック内依存関係のモデリングを弱める。本論文では、因果依存関係のモデル化と高コストな自己回帰型ドラフト実行を分離する投機的デコーディングフレームワーク「Domino」を提案する。Dominoはまず並列ドラフトバックボーンを用いてブロック全体の暫定的なドラフト分布を生成し、次に軽量なDominoヘッドを適用してプレフィックスに依存する因果情報でそれらを洗練する。教師強制的な因果符号化を安定させるために、さらにベースアンカー付きトレーニングカリキュラムを導入し、まず並列バックボーンを強化し、その後徐々に最適化を因果補正された最終分布へと移行させる。Qwen3モデルを用いた実験では、DominoはTransformersバックエンドで最大5.49倍のエンドツーエンドの高速化、SGLangサーバング環境で最大5.8倍のスループットの高速化を達成する。

English

Speculative decoding accelerates LLM inference by drafting multiple tokens and verifying them in parallel with the target model. However, its practical speedup is constrained by the trade-off between draft quality and drafting cost: autoregressive drafters model causal dependencies among draft tokens but incur sequential overhead, while parallel drafters reduce drafting cost but weaken intra-block dependency modeling. In this paper, we propose Domino, a speculative decoding framework that decouples causal dependency modeling from expensive autoregressive draft execution. Domino first uses a parallel draft backbone to produce preliminary draft distributions for the entire block, and then applies a lightweight Domino head to refine them with prefix-dependent causal information. To stabilize teacher-forced causal encoding, we further introduce a base-anchored training curriculum that first strengthens the parallel backbone and then gradually shifts optimization toward the causally corrected final distribution. Experiments on Qwen3 models show that Domino achieves up to \(5.49\times\) end-to-end speedup under the Transformers backend and up to \(5.8\times\) throughput speedup under SGLang serving.