JetSpec: 並列ツリードラフティングによる投機的デコードのスケーリング上限の打破

要旨

投機的復号（SD）は、複数のトークンをドラフトし並列に検証することで自己回帰型大規模言語モデル（LLM）を高速化する手法であるが、スケーリングに限界がある。すなわち、ドラフト予算を増やしても、受理率が高くドラフトのオーバーヘッドが低い場合にのみ速度向上が得られる。この上限を突破することはこれまで困難であった。なぜなら、従来のヘッドベースSD手法は因果性と効率性のジレンマに直面するからである。自己回帰型ドラフターは経路条件付き候補を生成し、より高い受理長を持つ木構造投機的復号に有効だが、ドラフトコストが木の深さに比例して増大する。一方、双方向ブロック拡散ドラフターは全ての位置を1パスで生成するが、分岐に依存しない周辺分布を利用するため、個々には妥当でも相互に矛盾する木を生成しやすく、予算を無駄にして受理率を低下させる。本稿では、1パスでのドラフト効率と分岐単位での因果的条件付けを組み合わせた、ヘッドベースSDフレームワークJetSpecを提案する。JetSpecは、凍結されたターゲットモデルから融合された隠れ状態の上に因果的並列ドラフトヘッドを学習し、ターゲットモデルの自己回帰分解と整合するスコアを持つ候補木を生成する。これにより、JetSpecはより大きなドラフト予算をより長い受理プレフィックスと高いエンドツーエンドの高速化に変換できる。密なモデルとMoE Qwen3モデルを用いた数学、コーディング、チャットのベンチマークにおいて、JetSpecは双方向ヘッドおよび木ベースSDのベースラインを一貫して上回る。H100 GPUでは、MATH-500で最大9.64倍、オープンエンドの対話ワークロードで4.58倍の高速化を達成し、vLLM統合による現実的なサーバ負荷下でのレイテンシ改善も実証した。コードとモデルは https://github.com/hao-ai-lab/JetSpec で公開している。

English

Speculative decoding (SD) accelerates autoregressive Large Language Models (LLMs) by drafting multiple tokens and verifying them in parallel, but it faces a scaling limitation: increasing the draft budget improves speed only when acceptance remains high and drafting overhead stays low. This ceiling has been difficult to break because prior head-based SD methods face a causality-efficiency dilemma. Autoregressive drafters produce path-conditioned candidates that are effective for tree speculative decoding with higher acceptance length, but their drafting cost grows with tree depth. Bidirectional block-diffusion drafters generate all positions in one pass, but their branch-agnostic marginals can form individually plausible yet mutually inconsistent trees, wasting budget and reducing acceptance. We propose JetSpec, a head-based SD framework that combines one-forward drafting efficiency with branch-wise causal conditioning. JetSpec trains a causal parallel draft head over fused hidden states from the frozen target model, producing candidate trees whose scores align with the target model's autoregressive factorization. This enables JetSpec to convert larger draft budgets into longer accepted prefixes and higher end-to-end speedup. Across math, coding, and chat benchmarks on dense and MoE Qwen3 models, JetSpec consistently outperforms bidirectional-head and tree-based SD baselines. On H100 GPUs, JetSpec achieves up to 9.64x speedup on MATH-500 and 4.58x on open-ended conversational workloads, with further latency gains demonstrated through vLLM integration under realistic serving loads. Our code and models are available at https://github.com/hao-ai-lab/JetSpec.