透過自適應平行解碼加速擴散式大型語言模型

摘要

大型語言模型（LLMs）的生成速度受制於自迴歸解碼，即逐個順序預測令牌。作為替代方案，擴散大型語言模型（dLLMs）理論上允許並行令牌生成，但在實踐中，若要在不顯著犧牲質量的情況下實現與自迴歸模型相當的速度，仍面臨挑戰。因此，我們引入了自適應並行解碼（APD），這是一種新穎的方法，能夠動態調整並行採樣的令牌數量。我們通過定義dLLM邊緣概率與小型輔助自迴歸模型下序列聯合概率之間的乘法混合來實現這一點。這反轉了推測解碼的標準設置，後者的目標是通過從較小模型中草擬來從大型自迴歸驗證器中採樣。我們進一步通過啟用KV緩存和限制掩碼輸入的大小來優化APD。總的來說，我們的方法提出了三個可調參數，以靈活地在吞吐量和質量之間進行權衡。我們展示了APD在下游基準測試中顯著提高了吞吐量，且質量下降最小。

English

The generation speed of LLMs are bottlenecked by autoregressive decoding, where tokens are predicted sequentially one by one. Alternatively, diffusion large language models (dLLMs) theoretically allow for parallel token generation, but in practice struggle to achieve the speed of autoregressive models without significantly sacrificing quality. We therefore introduce adaptive parallel decoding (APD), a novel method that dynamically adjusts the number of tokens sampled in parallel. We achieve this by defining a multiplicative mixture between the dLLM marginal probabilities and the joint probability of sequences under a small auxiliary autoregressive model. This inverts the standard setup of speculative decoding, where the goal is to sample from a large autoregressive verifier by drafting from a smaller model. We further optimize APD by enabling KV caching and limiting the size of the masked input. Altogether, our method puts forward three tunable parameters to flexibly tradeoff throughput and quality. We show that APD provides markedly higher throughput with minimal quality degradations on downstream benchmarks.

透過自適應平行解碼加速擴散式大型語言模型

Accelerating Diffusion LLMs via Adaptive Parallel Decoding

摘要

Support