ChatPaper.aiChatPaper

集合块解码是一种语言模型推理加速器

Set Block Decoding is a Language Model Inference Accelerator

September 4, 2025
作者: Itai Gat, Heli Ben-Hamu, Marton Havasi, Daniel Haziza, Jeremy Reizenstein, Gabriel Synnaeve, David Lopez-Paz, Brian Karrer, Yaron Lipman
cs.AI

摘要

自回归的下一个词预测语言模型虽具备强大能力,但在实际部署中面临重大挑战,主要源于推理阶段,尤其是解码阶段的高计算与内存成本。我们提出了一种简单灵活的范式——集合块解码(Set Block Decoding, SBD),它通过在同一架构中整合标准下一个词预测(NTP)与掩码词预测(MATP)来加速生成过程。SBD允许模型并行采样多个未来词,这些词不必连续,这是与以往加速方法的关键区别。这种灵活性使得能够利用离散扩散文献中的高级求解器,在不牺牲准确性的前提下显著提升速度。SBD无需改变架构或增加额外训练超参数,保持与精确KV缓存的兼容性,并可通过微调现有的下一个词预测模型实现。通过对Llama-3.1 8B和Qwen-3 8B进行微调,我们展示了SBD能够在保持与等效NTP训练相同性能的同时,将生成所需的前向传播次数减少3至5倍。
English
Autoregressive next token prediction language models offer powerful capabilities but face significant challenges in practical deployment due to the high computational and memory costs of inference, particularly during the decoding stage. We introduce Set Block Decoding (SBD), a simple and flexible paradigm that accelerates generation by integrating standard next token prediction (NTP) and masked token prediction (MATP) within a single architecture. SBD allows the model to sample multiple, not necessarily consecutive, future tokens in parallel, a key distinction from previous acceleration methods. This flexibility allows the use of advanced solvers from the discrete diffusion literature, offering significant speedups without sacrificing accuracy. SBD requires no architectural changes or extra training hyperparameters, maintains compatibility with exact KV-caching, and can be implemented by fine-tuning existing next token prediction models. By fine-tuning Llama-3.1 8B and Qwen-3 8B, we demonstrate that SBD enables a 3-5x reduction in the number of forward passes required for generation while achieving same performance as equivalent NTP training.
PDF476September 8, 2025