鏡像推測解碼:打破大型語言模型推理中的序列化瓶頸
Mirror Speculative Decoding: Breaking the Serial Barrier in LLM Inference
October 15, 2025
作者: Nikhil Bhendawade, Kumari Nishu, Arnav Kundu, Chris Bartels, Minsik Cho, Irina Belousova
cs.AI
摘要
推測解碼通過使用草稿模型進行前瞻來加速大型語言模型(LLM)的推理,但其增益受到自回歸草稿生成成本的限制:增加草稿大小會提高接受率,但會引入額外的延遲開銷,加劇速度與準確性的權衡。先前的方法(如Medusa、Hydra、EAGLE)部分降低了草稿成本,但要么降低了接受率,要么引入了限制擴展的開銷。我們提出了鏡像推測解碼(Mirror-SD),這是一種打破延遲與接受率權衡的推理算法。Mirror-SD從目標模型的後綴中並行啟動分支完整的展開,並在異構加速器(GPU和NPU)之間顯式映射計算,以利用跨設備並行性。草稿為目標模型推測前向延續以供驗證,而目標模型同時為草稿推測修正路徑,將推測轉化為兩個互補的執行管道。為了進一步降低草稿延遲而不削弱接受語義,我們增加了推測流式處理,使草稿每步生成多個令牌。這種並行異構執行加上多令牌推測流式處理的雙重策略,將推測解碼推向高接受率與低開銷的理想狀態。在SpecBench上,使用參數規模從14B到66B的服務器級模型,Mirror-SD在各種任務中實現了2.8倍至5.8倍的端到端加速,並在最強基線EAGLE3的基礎上平均相對提升了30%。
English
Speculative decoding accelerates LLM inference by using a draft model to look
ahead, but gains are capped by the cost of autoregressive draft generation:
increasing draft size elevates acceptance rates but introduces additional
latency overhead exacerbating the speed-accuracy tradeoff. Prior methods
(Medusa, Hydra, EAGLE) partially reduce draft cost but either degrade
acceptance or introduce overheads that limit scaling. We present Mirror
Speculative Decoding (Mirror-SD), an inference algorithm that breaks the
latency-acceptance tradeoff. Mirror-SD launches branch-complete rollouts from
early-exit signals in parallel with the target model's suffix and explicitly
maps computation across heterogeneous accelerators (GPU and NPU) to exploit
cross-device parallelism. The draft speculates forward continuations for the
target to verify, while the target simultaneously speculates correction paths
for the draft, converting speculation into two complementary execution
pipelines. To further cut draft latency without weakening acceptance semantics,
we add speculative streaming so the draft emits multiple tokens per step. This
dual strategy of parallel heterogeneous execution plus multi-token speculative
streaming pushes speculative decoding toward its ideal regime of high
acceptance with low overhead. On SpecBench with server-scale models from 14B to
66B parameters, Mirror-SD delivers consistent end-to-end gains, achieving
2.8x-5.8x wall-time speedups across diverse tasks and a 30% average relative
improvement over the strongest baseline, EAGLE3.