적응형 병렬 디코딩을 통한 Diffusion LLM 가속화

초록

대규모 언어 모델(LLM)의 생성 속도는 자동회귀 디코딩(autoregressive decoding)에 의해 병목 현상이 발생하며, 이는 토큰을 순차적으로 하나씩 예측하는 방식입니다. 반면, 확산 기반 대규모 언어 모델(diffusion large language models, dLLMs)은 이론적으로 병렬 토큰 생성을 가능하게 하지만, 실제로는 품질을 크게 저하시키지 않고서는 자동회귀 모델의 속도를 달성하는 데 어려움을 겪습니다. 따라서 우리는 병렬로 샘플링되는 토큰의 수를 동적으로 조절하는 새로운 방법인 적응형 병렬 디코딩(adaptive parallel decoding, APD)을 제안합니다. 이를 위해 dLLM의 주변 확률과 작은 보조 자동회귀 모델에서의 시퀀스 결합 확률 간의 곱셈 혼합(multiplicative mixture)을 정의합니다. 이는 일반적으로 작은 모델에서 초안을 작성하여 큰 자동회귀 검증 모델로부터 샘플링하는 스펙티브 디코딩(speculative decoding)의 표준 설정을 역전시킵니다. 또한, 우리는 KV 캐싱을 활성화하고 마스킹된 입력의 크기를 제한함으로써 APD를 더욱 최적화합니다. 종합적으로, 우리의 방법은 처리량과 품질 간의 유연한 트레이드오프를 위해 세 가지 조정 가능한 매개변수를 제시합니다. APD는 다운스트림 벤치마크에서 최소한의 품질 저하로 현저히 높은 처리량을 제공함을 보여줍니다.

English

The generation speed of LLMs are bottlenecked by autoregressive decoding, where tokens are predicted sequentially one by one. Alternatively, diffusion large language models (dLLMs) theoretically allow for parallel token generation, but in practice struggle to achieve the speed of autoregressive models without significantly sacrificing quality. We therefore introduce adaptive parallel decoding (APD), a novel method that dynamically adjusts the number of tokens sampled in parallel. We achieve this by defining a multiplicative mixture between the dLLM marginal probabilities and the joint probability of sequences under a small auxiliary autoregressive model. This inverts the standard setup of speculative decoding, where the goal is to sample from a large autoregressive verifier by drafting from a smaller model. We further optimize APD by enabling KV caching and limiting the size of the masked input. Altogether, our method puts forward three tunable parameters to flexibly tradeoff throughput and quality. We show that APD provides markedly higher throughput with minimal quality degradations on downstream benchmarks.

적응형 병렬 디코딩을 통한 Diffusion LLM 가속화

Accelerating Diffusion LLMs via Adaptive Parallel Decoding

초록

Support