DFlash: 플래시 예측 디코딩을 위한 블록 확산

초록

자동회귀 대규모 언어 모델(LLM)은 강력한 성능을 보여주지만 본질적으로 순차적인 디코딩이 필요해 추론 지연 시간이 길고 GPU 활용도가 낮은 문제가 있습니다. 스페큘레이티브 디코딩은 빠른 드래프트 모델을 사용해 그 출력을 대상 LLM이 병렬로 검증하는 방식으로 이 병목 현상을 완화하지만, 기존 방법은 여전히 순차적인 자동회귀 방식의 드래프팅에 의존하여 실질적인 속도 향상에 한계가 있습니다. 확산 LLM은 병렬 생성을 가능하게 하여 유망한 대안을 제시하지만, 현재 확산 모델은 일반적으로 자동회귀 모델 대비 성능이 낮습니다. 본 논문에서는 병렬 드래프팅을 위해 경량 블록 확산 모델을 활용하는 스페큘레이티브 디코딩 프레임워크인 DFlash를 소개합니다. DFlash는 단일 순방향 전달로 드래프트 토큰을 생성하고 대상 모델에서 추출한 컨텍스트 특징을 드래프트 모델의 조건으로 사용함으로써 높은 품질의 출력과 높은 수용률을 갖춘 효율적인 드래프팅을 가능하게 합니다. 실험 결과, DFlash는 다양한 모델과 작업에서 6배 이상의 무손실 가속을 달성하며 최신 스페큘레이티브 디코딩 방법인 EAGLE-3 대비 최대 2.5배 높은 속도 향상을 보여줍니다.

English

Autoregressive large language models (LLMs) deliver strong performance but require inherently sequential decoding, leading to high inference latency and poor GPU utilization. Speculative decoding mitigates this bottleneck by using a fast draft model whose outputs are verified in parallel by the target LLM; however, existing methods still rely on autoregressive drafting, which remains sequential and limits practical speedups. Diffusion LLMs offer a promising alternative by enabling parallel generation, but current diffusion models typically underperform compared with autoregressive models. In this paper, we introduce DFlash, a speculative decoding framework that employs a lightweight block diffusion model for parallel drafting. By generating draft tokens in a single forward pass and conditioning the draft model on context features extracted from the target model, DFlash enables efficient drafting with high-quality outputs and higher acceptance rates. Experiments show that DFlash achieves over 6x lossless acceleration across a range of models and tasks, delivering up to 2.5x higher speedup than the state-of-the-art speculative decoding method EAGLE-3.

DFlash: 플래시 예측 디코딩을 위한 블록 확산

DFlash: Block Diffusion for Flash Speculative Decoding

초록

Support