基于雅可比强迫的快速精准因果并行解码方法
Fast and Accurate Causal Parallel Decoding using Jacobi Forcing
December 16, 2025
作者: Lanxiang Hu, Siqi Kou, Yichao Fu, Samyam Rajbhandari, Tajana Rosing, Yuxiong He, Zhijie Deng, Hao Zhang
cs.AI
摘要
多令牌生成已成为加速基于Transformer的大模型推理的重要范式。近期研究主要探索扩散式大语言模型(dLLM)的并行解码能力以降低推理延迟。为达到自回归模型的生成质量,现有技术多将AR模型适配为dLLM以实现并行解码。然而,由于预训练与后训练不匹配问题,这些方法相较AR模型的加速效果有限。具体而言,后训练中的掩码数据分布与预训练接触的真实数据分布存在显著偏差,且dLLM依赖的双向注意力机制与预训练习得的因果先验冲突,阻碍了精确KV缓存重用的实现。为此,我们提出雅可比强迫法——一种渐进式蒸馏范式,通过让模型在自身生成的并行解码轨迹上进行训练,在保持预训练因果推理特性的同时,将AR模型平滑转换为高效并行解码器。基于该范式训练的雅可比强迫模型在代码和数学基准测试中实现了3.8倍实际加速比且性能损失极小。针对该模型的轨迹特性,我们进一步提出带拒绝回收的多块解码机制,使单次迭代的令牌接受数量提升至4.5倍,实际加速比接近4.0倍,实现了计算资源与推理延迟的高效权衡。代码已开源:https://github.com/hao-ai-lab/JacobiForcing。
English
Multi-token generation has emerged as a promising paradigm for accelerating transformer-based large model inference. Recent efforts primarily explore diffusion Large Language Models (dLLMs) for parallel decoding to reduce inference latency. To achieve AR-level generation quality, many techniques adapt AR models into dLLMs to enable parallel decoding. However, they suffer from limited speedup compared to AR models due to a pretrain-to-posttrain mismatch. Specifically, the masked data distribution in post-training deviates significantly from the real-world data distribution seen during pretraining, and dLLMs rely on bidirectional attention, which conflicts with the causal prior learned during pretraining and hinders the integration of exact KV cache reuse. To address this, we introduce Jacobi Forcing, a progressive distillation paradigm where models are trained on their own generated parallel decoding trajectories, smoothly shifting AR models into efficient parallel decoders while preserving their pretrained causal inference property. The models trained under this paradigm, Jacobi Forcing Model, achieves 3.8x wall-clock speedup on coding and math benchmarks with minimal loss in performance. Based on Jacobi Forcing Models' trajectory characteristics, we introduce multi-block decoding with rejection recycling, which enables up to 4.5x higher token acceptance count per iteration and nearly 4.0x wall-clock speedup, effectively trading additional compute for lower inference latency. Our code is available at https://github.com/hao-ai-lab/JacobiForcing.