JetSpec: 通过并行树草稿突破推测解码的扩展天花板

摘要

推测解码通过草拟多个令牌并并行验证，加速自回归大语言模型，但其面临扩展瓶颈：仅在接受率保持较高且草拟开销较低时，增加草拟预算才能提升速度。这一天花板难以突破，因为此前基于头部的推测解码方法面临因果-效率困境。自回归草拟器生成的路径条件候选令牌适用于树形推测解码，可获得更高的接受长度，但其草拟成本随树深度增长。双向模块扩散草拟器可一次生成所有位置，但其与分支无关的边缘分布可能形成单个合理但相互矛盾的树形结构，浪费预算并降低接受率。我们提出JetSpec，一种基于头部的推测解码框架，将单次前向草拟效率与分支级因果条件结合。JetSpec在冻结目标模型的融合隐状态上训练因果并行草拟头，生成的候选树得分与目标模型的自回归因式分解对齐。这使得JetSpec能够将更大的草拟预算转化为更长的接受前缀和更高的端到端加速比。在密集与MoE Qwen3模型的数学、编程和对话基准测试中，JetSpec持续优于双向头部和树形推测解码基线。在H100 GPU上，JetSpec在MATH-500上实现最高9.64倍加速，在开放式对话工作负载上实现4.58倍加速，并通过vLLM集成在真实服务负载下进一步降低延迟。我们的代码和模型可在https://github.com/hao-ai-lab/JetSpec获取。

English

Speculative decoding (SD) accelerates autoregressive Large Language Models (LLMs) by drafting multiple tokens and verifying them in parallel, but it faces a scaling limitation: increasing the draft budget improves speed only when acceptance remains high and drafting overhead stays low. This ceiling has been difficult to break because prior head-based SD methods face a causality-efficiency dilemma. Autoregressive drafters produce path-conditioned candidates that are effective for tree speculative decoding with higher acceptance length, but their drafting cost grows with tree depth. Bidirectional block-diffusion drafters generate all positions in one pass, but their branch-agnostic marginals can form individually plausible yet mutually inconsistent trees, wasting budget and reducing acceptance. We propose JetSpec, a head-based SD framework that combines one-forward drafting efficiency with branch-wise causal conditioning. JetSpec trains a causal parallel draft head over fused hidden states from the frozen target model, producing candidate trees whose scores align with the target model's autoregressive factorization. This enables JetSpec to convert larger draft budgets into longer accepted prefixes and higher end-to-end speedup. Across math, coding, and chat benchmarks on dense and MoE Qwen3 models, JetSpec consistently outperforms bidirectional-head and tree-based SD baselines. On H100 GPUs, JetSpec achieves up to 9.64x speedup on MATH-500 and 4.58x on open-ended conversational workloads, with further latency gains demonstrated through vLLM integration under realistic serving loads. Our code and models are available at https://github.com/hao-ai-lab/JetSpec.