Acelerando LLMs de Difusão por meio de Decodificação Paralela Adaptativa

Resumo

A velocidade de geração dos LLMs é limitada pela decodificação autoregressiva, onde os tokens são previstos sequencialmente, um por um. Alternativamente, os modelos de linguagem de grande escala baseados em difusão (dLLMs) teoricamente permitem a geração paralela de tokens, mas, na prática, lutam para alcançar a velocidade dos modelos autoregressivos sem sacrificar significativamente a qualidade. Portanto, introduzimos a decodificação paralela adaptativa (APD), um método novo que ajusta dinamicamente o número de tokens amostrados em paralelo. Isso é alcançado definindo uma mistura multiplicativa entre as probabilidades marginais do dLLM e a probabilidade conjunta de sequências sob um pequeno modelo autoregressivo auxiliar. Isso inverte a configuração padrão da decodificação especulativa, onde o objetivo é amostrar de um verificador autoregressivo grande usando rascunhos de um modelo menor. Otimizamos ainda mais o APD ao habilitar o cache de KV e limitar o tamanho da entrada mascarada. No geral, nosso método apresenta três parâmetros ajustáveis para equilibrar de forma flexível a taxa de transferência e a qualidade. Demonstramos que o APD oferece uma taxa de transferência significativamente maior com degradações mínimas de qualidade em benchmarks downstream.

English

The generation speed of LLMs are bottlenecked by autoregressive decoding, where tokens are predicted sequentially one by one. Alternatively, diffusion large language models (dLLMs) theoretically allow for parallel token generation, but in practice struggle to achieve the speed of autoregressive models without significantly sacrificing quality. We therefore introduce adaptive parallel decoding (APD), a novel method that dynamically adjusts the number of tokens sampled in parallel. We achieve this by defining a multiplicative mixture between the dLLM marginal probabilities and the joint probability of sequences under a small auxiliary autoregressive model. This inverts the standard setup of speculative decoding, where the goal is to sample from a large autoregressive verifier by drafting from a smaller model. We further optimize APD by enabling KV caching and limiting the size of the masked input. Altogether, our method puts forward three tunable parameters to flexibly tradeoff throughput and quality. We show that APD provides markedly higher throughput with minimal quality degradations on downstream benchmarks.

Acelerando LLMs de Difusão por meio de Decodificação Paralela Adaptativa

Accelerating Diffusion LLMs via Adaptive Parallel Decoding

Resumo

Support