通过自适应并行解码加速扩散式大语言模型

摘要

大型语言模型（LLM）的生成速度受限于自回归解码过程，即逐个顺序预测令牌。相比之下，扩散大语言模型（dLLM）理论上允许并行令牌生成，但在实践中难以在不显著牺牲质量的情况下达到自回归模型的速度。为此，我们引入了自适应并行解码（APD），这是一种动态调整并行采样令牌数量的新方法。我们通过定义dLLM边缘概率与小型辅助自回归模型下序列联合概率之间的乘法混合来实现这一点。这反转了推测性解码的标准设置，后者的目标是通过从较小模型中草拟样本来从大型自回归验证器中采样。我们进一步通过启用KV缓存和限制掩码输入的大小来优化APD。总体而言，我们的方法提出了三个可调参数，以灵活权衡吞吐量和质量。我们证明，在下游基准测试中，APD在质量损失最小的情况下显著提高了吞吐量。

English

The generation speed of LLMs are bottlenecked by autoregressive decoding, where tokens are predicted sequentially one by one. Alternatively, diffusion large language models (dLLMs) theoretically allow for parallel token generation, but in practice struggle to achieve the speed of autoregressive models without significantly sacrificing quality. We therefore introduce adaptive parallel decoding (APD), a novel method that dynamically adjusts the number of tokens sampled in parallel. We achieve this by defining a multiplicative mixture between the dLLM marginal probabilities and the joint probability of sequences under a small auxiliary autoregressive model. This inverts the standard setup of speculative decoding, where the goal is to sample from a large autoregressive verifier by drafting from a smaller model. We further optimize APD by enabling KV caching and limiting the size of the masked input. Altogether, our method puts forward three tunable parameters to flexibly tradeoff throughput and quality. We show that APD provides markedly higher throughput with minimal quality degradations on downstream benchmarks.

通过自适应并行解码加速扩散式大语言模型

Accelerating Diffusion LLMs via Adaptive Parallel Decoding

摘要

Support