Fast-dLLM: KV 캐시 활성화와 병렬 디코딩을 통한 학습 없이 가속화된 Diffusion LLM

초록

확산 기반 대규모 언어 모델(Diffusion LLMs)은 병렬 디코딩 기능을 갖춘 비자기회귀적 텍스트 생성에서 유망한 가능성을 보여주고 있습니다. 그러나 오픈소스 Diffusion LLMs의 실제 추론 속도는 키-값(KV) 캐시의 부재와 동시에 여러 토큰을 디코딩할 때 발생하는 품질 저하로 인해 자기회귀 모델에 비해 뒤처지는 경우가 많습니다. 이러한 격차를 해소하기 위해, 우리는 양방향 확산 모델에 적합한 새로운 블록 단위 근사 KV 캐시 메커니즘을 도입하여 성능 저하를 최소화하면서 캐시 재사용을 가능하게 했습니다. 또한, 병렬 디코딩에서 생성 품질 저하의 근본 원인을 조건부 독립 가정 하에서 토큰 간 의존성이 깨지는 것으로 파악했습니다. 이를 해결하기 위해, 우리는 신뢰도 임계값을 초과하는 토큰을 선택적으로 디코딩하는 신뢰도 기반 병렬 디코딩 전략을 제안하여 의존성 위반을 완화하고 생성 품질을 유지했습니다. LLaDA 및 Dream 모델을 대상으로 한 여러 LLM 벤치마크에서의 실험 결과는 최소한의 정확도 손실로 최대 27.6배의 처리량 향상을 보여주었으며, 이는 자기회귀 모델과의 성능 격차를 줄이고 Diffusion LLMs의 실질적인 배포를 위한 길을 열어줍니다.

English

Diffusion-based large language models (Diffusion LLMs) have shown promise for non-autoregressive text generation with parallel decoding capabilities. However, the practical inference speed of open-sourced Diffusion LLMs often lags behind autoregressive models due to the lack of Key-Value (KV) Cache and quality degradation when decoding multiple tokens simultaneously. To bridge this gap, we introduce a novel block-wise approximate KV Cache mechanism tailored for bidirectional diffusion models, enabling cache reuse with negligible performance drop. Additionally, we identify the root cause of generation quality degradation in parallel decoding as the disruption of token dependencies under the conditional independence assumption. To address this, we propose a confidence-aware parallel decoding strategy that selectively decodes tokens exceeding a confidence threshold, mitigating dependency violations and maintaining generation quality. Experimental results on LLaDA and Dream models across multiple LLM benchmarks demonstrate up to 27.6times throughput improvement with minimal accuracy loss, closing the performance gap with autoregressive models and paving the way for practical deployment of Diffusion LLMs.

Fast-dLLM: KV 캐시 활성화와 병렬 디코딩을 통한 학습 없이 가속화된 Diffusion LLM

Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

초록

Support