DFlash: フラッシュ投機的デコードのためのブロック拡散

要旨

自己回帰型大規模言語モデル（LLM）は高い性能を発揮する一方、本質的に逐次的なデコード処理を必要とするため、推論時の遅延が大きく、GPUの利用率も低いという課題がある。投機的デコードは、高速な下書きモデルを用いて出力を生成し、ターゲットLLMによる並列検証を行うことでこのボトルネックを緩和する。しかし、既存の手法では依然として自己回帰型の下書き生成に依存しており、逐次処理が残るため実効的な高速化には限界がある。拡散モデルに基づくLLMは並列生成が可能な代替手段として有望だが、現状の拡散モデルは自己回帰型モデルに比べて性能が劣ることが一般的である。本論文では、軽量なブロック拡散モデルを並列下書き生成に用いる投機的デコードフレームワーク「DFlash」を提案する。DFlashは単一の順伝播で下書きトークンを生成し、ターゲットモデルから抽出した文脈特徴を下書きモデルの条件付けに活用することで、高品質な出力と高い受理率を実現する効率的な下書き生成を可能にする。実験の結果、DFlashは様々なモデルとタスクにおいて6倍以上のロスレス加速を達成し、最新の投機的デコード手法であるEAGLE-3と比較して最大2.5倍高い高速化を実現することが示された。

English

Autoregressive large language models (LLMs) deliver strong performance but require inherently sequential decoding, leading to high inference latency and poor GPU utilization. Speculative decoding mitigates this bottleneck by using a fast draft model whose outputs are verified in parallel by the target LLM; however, existing methods still rely on autoregressive drafting, which remains sequential and limits practical speedups. Diffusion LLMs offer a promising alternative by enabling parallel generation, but current diffusion models typically underperform compared with autoregressive models. In this paper, we introduce DFlash, a speculative decoding framework that employs a lightweight block diffusion model for parallel drafting. By generating draft tokens in a single forward pass and conditioning the draft model on context features extracted from the target model, DFlash enables efficient drafting with high-quality outputs and higher acceptance rates. Experiments show that DFlash achieves over 6x lossless acceleration across a range of models and tasks, delivering up to 2.5x higher speedup than the state-of-the-art speculative decoding method EAGLE-3.

DFlash: フラッシュ投機的デコードのためのブロック拡散

DFlash: Block Diffusion for Flash Speculative Decoding

要旨

Support