S2D2：基于免训练自推测的扩散大语言模型快速解码方法

摘要

块扩散语言模型通过结合块级自回归解码与块内并行去噪，为实现超自回归生成速度提供了可行路径。然而在实际加速所需的少步数生成场景中，标准置信度阈值解码往往表现脆弱：激进阈值会损害生成质量，而保守阈值则需冗余去噪步骤。现有解决方案或需额外训练，或增加推理时计算开销。我们提出S2D2——一种免训练的块扩散语言模型自推测解码框架。核心发现是当块大小缩减至单令牌时，块扩散模型会退化为自回归模式，这使得同一预训练模型可同时担任起草器和验证器。S2D2在标准块扩散解码中插入推测验证步骤，并采用轻量级路由策略动态评估验证成本收益。由此形成混合解码轨迹：扩散模式并行生成令牌候选，而自回归模式充当局部序列级评判器。在三大主流块扩散模型上的实验表明，S2D2在精度-速度权衡方面持续优于强置信度阈值基线。在SDAR任务中，相较自回归解码实现最高4.7倍加速，较调优的动态解码基线提升1.57倍速度的同时精度提高4.5个百分点。在LLaDA2.1-Mini上，S2D2与内置自校正机制形成互补，在保守设置下以4.4倍加速优于静态基线且精度微升。

English

Block-diffusion language models offer a promising path toward faster-than-autoregressive generation by combining block-wise autoregressive decoding with within-block parallel denoising. However, in the few-step regime needed for practical acceleration, standard confidence-thresholded decoding is often brittle: aggressive thresholds hurt quality, while conservative thresholds require unnecessary denoising steps. Existing approaches that address this issue either require additional training or incur extra test-time compute. We present S2D2, a training-free self-speculative decoding framework for block-diffusion language models. Our key observation is that a block-diffusion model becomes autoregressive when the block size is reduced to one, allowing the same pretrained model to act as both drafter and verifier. S2D2 inserts a speculative verification step into standard block-diffusion decoding and uses lightweight routing policies to decide when verification is worth its cost. This yields a hybrid decoding trajectory in which diffusion proposes tokens in parallel, while the autoregressive mode acts as a local sequence-level critic. Across three mainstream block-diffusion families, S2D2 consistently improves the accuracy-speed tradeoff over strong confidence-thresholding baselines. On SDAR, we observe up to 4.7times speedup over autoregressive decoding, and up to 1.57times over a tuned dynamic decoding baseline while improving accuracy by up to 4.5 points. On LLaDA2.1-Mini, S2D2 remains complementary to built-in self-correction, including a conservative setting where it is 4.4times faster than the static baseline with slightly higher accuracy.