镜像推测解码:打破大语言模型推理中的串行瓶颈
Mirror Speculative Decoding: Breaking the Serial Barrier in LLM Inference
October 15, 2025
作者: Nikhil Bhendawade, Kumari Nishu, Arnav Kundu, Chris Bartels, Minsik Cho, Irina Belousova
cs.AI
摘要
推测解码通过使用草稿模型进行前瞻来加速大语言模型(LLM)推理,但其增益受限于自回归草稿生成的成本:增加草稿规模虽能提升接受率,却引入了额外的延迟开销,加剧了速度与准确性的权衡。先前的方法(如Medusa、Hydra、EAGLE)虽部分降低了草稿成本,但要么降低了接受率,要么引入了限制扩展的开销。我们提出镜像推测解码(Mirror-SD),一种打破延迟与接受率权衡的推理算法。Mirror-SD从目标模型后缀的早期退出信号并行启动分支完整展开,并明确将计算映射到异构加速器(GPU与NPU)上,以利用跨设备并行性。草稿推测目标模型需验证的前向延续,而目标模型同时推测草稿的修正路径,将推测转化为两条互补的执行流水线。为在不削弱接受语义的前提下进一步削减草稿延迟,我们增加了推测流式处理,使草稿每步生成多个令牌。这种并行异构执行加多令牌推测流式处理的双重策略,推动推测解码向高接受率低开销的理想状态迈进。在SpecBench上,针对14B至66B参数的服务器级模型,Mirror-SD实现了端到端的持续增益,在多样化任务中取得了2.8倍至5.8倍的墙钟时间加速,相较于最强基线EAGLE3,平均相对提升达30%。
English
Speculative decoding accelerates LLM inference by using a draft model to look
ahead, but gains are capped by the cost of autoregressive draft generation:
increasing draft size elevates acceptance rates but introduces additional
latency overhead exacerbating the speed-accuracy tradeoff. Prior methods
(Medusa, Hydra, EAGLE) partially reduce draft cost but either degrade
acceptance or introduce overheads that limit scaling. We present Mirror
Speculative Decoding (Mirror-SD), an inference algorithm that breaks the
latency-acceptance tradeoff. Mirror-SD launches branch-complete rollouts from
early-exit signals in parallel with the target model's suffix and explicitly
maps computation across heterogeneous accelerators (GPU and NPU) to exploit
cross-device parallelism. The draft speculates forward continuations for the
target to verify, while the target simultaneously speculates correction paths
for the draft, converting speculation into two complementary execution
pipelines. To further cut draft latency without weakening acceptance semantics,
we add speculative streaming so the draft emits multiple tokens per step. This
dual strategy of parallel heterogeneous execution plus multi-token speculative
streaming pushes speculative decoding toward its ideal regime of high
acceptance with low overhead. On SpecBench with server-scale models from 14B to
66B parameters, Mirror-SD delivers consistent end-to-end gains, achieving
2.8x-5.8x wall-time speedups across diverse tasks and a 30% average relative
improvement over the strongest baseline, EAGLE3.