集合块解码是一种语言模型推理加速器

摘要

自回归式下一令牌预测语言模型虽具备强大功能，但在实际部署中面临重大挑战，主要源于推理阶段，尤其是解码阶段的高计算与内存成本。我们提出了一种简单而灵活的范式——集合块解码（Set Block Decoding, SBD），该范式通过在同一架构内整合标准的下一令牌预测（Next Token Prediction, NTP）与掩码令牌预测（Masked Token Prediction, MATP），以加速生成过程。SBD允许模型并行采样多个未来令牌，这些令牌不必连续，这是与以往加速方法的关键区别。这种灵活性使得能够利用离散扩散文献中的高级求解器，在不牺牲准确性的前提下实现显著加速。SBD无需改变架构或增加训练超参数，保持与精确KV缓存的兼容性，并可通过微调现有的下一令牌预测模型来实现。通过对Llama-3.1 8B和Qwen-3 8B进行微调，我们展示了SBD能够在保持与等效NTP训练相同性能的同时，将生成所需的前向传递次数减少3至5倍。

English

Autoregressive next token prediction language models offer powerful capabilities but face significant challenges in practical deployment due to the high computational and memory costs of inference, particularly during the decoding stage. We introduce Set Block Decoding (SBD), a simple and flexible paradigm that accelerates generation by integrating standard next token prediction (NTP) and masked token prediction (MATP) within a single architecture. SBD allows the model to sample multiple, not necessarily consecutive, future tokens in parallel, a key distinction from previous acceleration methods. This flexibility allows the use of advanced solvers from the discrete diffusion literature, offering significant speedups without sacrificing accuracy. SBD requires no architectural changes or extra training hyperparameters, maintains compatibility with exact KV-caching, and can be implemented by fine-tuning existing next token prediction models. By fine-tuning Llama-3.1 8B and Qwen-3 8B, we demonstrate that SBD enables a 3-5x reduction in the number of forward passes required for generation while achieving same performance as equivalent NTP training.

集合块解码是一种语言模型推理加速器

Set Block Decoding is a Language Model Inference Accelerator

摘要

Support