DFlash:面向Flash推理加速的块扩散技术
DFlash: Block Diffusion for Flash Speculative Decoding
February 5, 2026
作者: Jian Chen, Yesheng Liang, Zhijian Liu
cs.AI
摘要
自回归大语言模型(LLMs)虽具备强大性能,但其固有的序列化解码机制会导致高推理延迟与GPU利用率低下。推测式解码通过采用快速草稿模型来缓解这一瓶颈——该模型生成的输出由目标LLM并行验证;然而现有方法仍依赖自回归式草稿生成,其序列化本质限制了实际加速效果。扩散式LLMs通过并行生成提供了有前景的替代方案,但当前扩散模型性能通常逊于自回归模型。本文提出DFlash推测解码框架,采用轻量级块扩散模型实现并行草稿生成。通过单次前向传播生成草稿词元,并基于目标模型提取的上下文特征对草稿模型进行条件约束,DFlash能够以高质量输出和高接受率实现高效草稿生成。实验表明,DFlash在多种模型与任务中实现了超过6倍的无损加速,较当前最先进的推测解码方法EAGLE-3提速最高达2.5倍。
English
Autoregressive large language models (LLMs) deliver strong performance but require inherently sequential decoding, leading to high inference latency and poor GPU utilization. Speculative decoding mitigates this bottleneck by using a fast draft model whose outputs are verified in parallel by the target LLM; however, existing methods still rely on autoregressive drafting, which remains sequential and limits practical speedups. Diffusion LLMs offer a promising alternative by enabling parallel generation, but current diffusion models typically underperform compared with autoregressive models. In this paper, we introduce DFlash, a speculative decoding framework that employs a lightweight block diffusion model for parallel drafting. By generating draft tokens in a single forward pass and conditioning the draft model on context features extracted from the target model, DFlash enables efficient drafting with high-quality outputs and higher acceptance rates. Experiments show that DFlash achieves over 6x lossless acceleration across a range of models and tasks, delivering up to 2.5x higher speedup than the state-of-the-art speculative decoding method EAGLE-3.