拡散型大規模言語モデルの高速化：適応的並列デコードによるアプローチ

要旨

大規模言語モデル（LLM）の生成速度は、トークンを逐次的に予測する自己回帰型デコードによってボトルネックが生じています。一方で、拡散型大規模言語モデル（dLLM）は理論的には並列トークン生成を可能としますが、実際には品質を大幅に犠牲にすることなく自己回帰モデルの速度を達成するのに苦労しています。そこで我々は、並列にサンプリングするトークン数を動的に調整する新しい手法である適応型並列デコード（APD）を導入します。これを実現するために、dLLMの周辺確率と小さな補助的自己回帰モデル下での系列の結合確率との乗法混合を定義します。これは、小さなモデルからドラフトを作成して大きな自己回帰型検証器からサンプリングすることを目的とする、推測的デコードの標準的な設定を逆転させます。さらに、KVキャッシュを有効にし、マスクされた入力のサイズを制限することでAPDを最適化します。全体として、我々の手法はスループットと品質の柔軟なトレードオフを可能にする3つの調整可能なパラメータを提示します。APDが下流ベンチマークにおいて最小限の品質低下で著しく高いスループットを提供することを示します。

English

The generation speed of LLMs are bottlenecked by autoregressive decoding, where tokens are predicted sequentially one by one. Alternatively, diffusion large language models (dLLMs) theoretically allow for parallel token generation, but in practice struggle to achieve the speed of autoregressive models without significantly sacrificing quality. We therefore introduce adaptive parallel decoding (APD), a novel method that dynamically adjusts the number of tokens sampled in parallel. We achieve this by defining a multiplicative mixture between the dLLM marginal probabilities and the joint probability of sequences under a small auxiliary autoregressive model. This inverts the standard setup of speculative decoding, where the goal is to sample from a large autoregressive verifier by drafting from a smaller model. We further optimize APD by enabling KV caching and limiting the size of the masked input. Altogether, our method puts forward three tunable parameters to flexibly tradeoff throughput and quality. We show that APD provides markedly higher throughput with minimal quality degradations on downstream benchmarks.

拡散型大規模言語モデルの高速化：適応的並列デコードによるアプローチ

Accelerating Diffusion LLMs via Adaptive Parallel Decoding

要旨

Support