集合块解码是一种语言模型推理加速器
Set Block Decoding is a Language Model Inference Accelerator
September 4, 2025
作者: Itai Gat, Heli Ben-Hamu, Marton Havasi, Daniel Haziza, Jeremy Reizenstein, Gabriel Synnaeve, David Lopez-Paz, Brian Karrer, Yaron Lipman
cs.AI
摘要
自回归的下一个词预测语言模型虽具备强大能力,但在实际部署中面临重大挑战,主要源于推理阶段,尤其是解码阶段的高计算与内存成本。我们提出了一种简单灵活的范式——集合块解码(Set Block Decoding, SBD),它通过在同一架构中整合标准下一个词预测(NTP)与掩码词预测(MATP)来加速生成过程。SBD允许模型并行采样多个未来词,这些词不必连续,这是与以往加速方法的关键区别。这种灵活性使得能够利用离散扩散文献中的高级求解器,在不牺牲准确性的前提下显著提升速度。SBD无需改变架构或增加额外训练超参数,保持与精确KV缓存的兼容性,并可通过微调现有的下一个词预测模型实现。通过对Llama-3.1 8B和Qwen-3 8B进行微调,我们展示了SBD能够在保持与等效NTP训练相同性能的同时,将生成所需的前向传播次数减少3至5倍。
English
Autoregressive next token prediction language models offer powerful
capabilities but face significant challenges in practical deployment due to the
high computational and memory costs of inference, particularly during the
decoding stage. We introduce Set Block Decoding (SBD), a simple and flexible
paradigm that accelerates generation by integrating standard next token
prediction (NTP) and masked token prediction (MATP) within a single
architecture. SBD allows the model to sample multiple, not necessarily
consecutive, future tokens in parallel, a key distinction from previous
acceleration methods. This flexibility allows the use of advanced solvers from
the discrete diffusion literature, offering significant speedups without
sacrificing accuracy. SBD requires no architectural changes or extra training
hyperparameters, maintains compatibility with exact KV-caching, and can be
implemented by fine-tuning existing next token prediction models. By
fine-tuning Llama-3.1 8B and Qwen-3 8B, we demonstrate that SBD enables a 3-5x
reduction in the number of forward passes required for generation while
achieving same performance as equivalent NTP training.