雙解碼：基於硬體感知的異構推測解碼與動態多序列草稿生成

摘要

大型語言模型（LLMs）在多種任務中展現出卓越的性能；然而，其逐個令牌的自迴歸生成過程顯著阻礙了推理速度。推測解碼提出了一種有前景的草稿-驗證框架，能在保持輸出分佈保真度的同時降低生成延遲。然而，草稿模型引入了額外的計算開銷，成為性能瓶頸並增加了首個令牌生成時間（TTFT）。先前減輕草稿模型開銷的方法主要依賴於啟發式策略，通常無法匹配草稿語言模型的質量。為應對這些挑戰，我們提出了DuoDecoding，這是一種新穎的方法，策略性地將草稿模型和目標模型分別部署在CPU和GPU上，實現並行解碼的同時保持草稿質量。我們的方法結合了硬件感知的最優草稿預算，以最小化空閒時間，並採用動態多序列草稿生成來提升草稿質量。在七項任務上的廣泛實驗表明，DuoDecoding在生成延遲上實現了最高2.61倍的加速，同時將TTFT降低至傳統推測解碼的83%。代碼可在https://github.com/KaiLv69/DuoDecoding獲取。

English

Large language models (LLMs) exhibit exceptional performance across a wide range of tasks; however, their token-by-token autoregressive generation process significantly hinders inference speed. Speculative decoding presents a promising draft-then-verify framework that reduces generation latency while maintaining output distribution fidelity. Nevertheless, the draft model introduces additional computational overhead, becoming a performance bottleneck and increasing the time to first token (TTFT). Previous approaches to mitigate draft model overhead have primarily relied on heuristics and generally failed to match the quality of the draft language models. To address these challenges, we propose DuoDecoding, a novel approach that strategically deploys the draft and target models on the CPU and GPU respectively, enabling parallel decoding while preserving draft quality. Our method incorporates a hardware-aware optimal draft budget to minimize idle times and employs dynamic multi-sequence drafting to enhance draft quality. Extensive experiments across seven tasks show that DuoDecoding achieves up to 2.61x speedup in generation latency, while reducing TTFT to 83% of that in conventional speculative decoding. The Code is available at https://github.com/KaiLv69/DuoDecoding.

雙解碼：基於硬體感知的異構推測解碼與動態多序列草稿生成

DuoDecoding: Hardware-aware Heterogeneous Speculative Decoding with Dynamic Multi-Sequence Drafting

摘要

Support