ChatPaper.aiChatPaper

基於雅可比強制法的快速精準因果平行解碼

Fast and Accurate Causal Parallel Decoding using Jacobi Forcing

December 16, 2025
作者: Lanxiang Hu, Siqi Kou, Yichao Fu, Samyam Rajbhandari, Tajana Rosing, Yuxiong He, Zhijie Deng, Hao Zhang
cs.AI

摘要

多詞元生成已成為加速基於Transformer的大型模型推理的潛力範式。近期研究主要探索擴散式大型語言模型(dLLM)的平行解碼技術以降低推理延遲。為達到自迴歸級別的生成品質,現有技術多將AR模型改造成dLLM以實現平行解碼,但由於預訓練與後訓練間的失配問題,其加速效果仍受限於AR模型。具體而言,後訓練中使用的掩碼數據分佈與預訓練階段的真實數據分佈存在顯著偏差,且dLLM依賴的雙向注意力機制與預訓練習得的因果先驗相衝突,阻礙了精確KV快取重用的實現。為此,我們提出雅可比強迫法——一種漸進式蒸餾範式,通過在模型自身生成的平行解碼軌跡上進行訓練,平滑地將AR模型轉化為高效平行解碼器,同時保留其預訓練的因果推理特性。基於此範式訓練的雅可比強迫模型在編程與數學基準測試中實現3.8倍實時加速,且性能損失極小。針對該模型的軌跡特性,我們進一步提出帶拒絕回收機制的多區塊解碼法,使單次迭代的詞元接受數量提升至4.5倍,實時加速比接近4.0倍,有效實現計算資源與推理延遲的權衡。程式碼已開源於:https://github.com/hao-ai-lab/JacobiForcing。
English
Multi-token generation has emerged as a promising paradigm for accelerating transformer-based large model inference. Recent efforts primarily explore diffusion Large Language Models (dLLMs) for parallel decoding to reduce inference latency. To achieve AR-level generation quality, many techniques adapt AR models into dLLMs to enable parallel decoding. However, they suffer from limited speedup compared to AR models due to a pretrain-to-posttrain mismatch. Specifically, the masked data distribution in post-training deviates significantly from the real-world data distribution seen during pretraining, and dLLMs rely on bidirectional attention, which conflicts with the causal prior learned during pretraining and hinders the integration of exact KV cache reuse. To address this, we introduce Jacobi Forcing, a progressive distillation paradigm where models are trained on their own generated parallel decoding trajectories, smoothly shifting AR models into efficient parallel decoders while preserving their pretrained causal inference property. The models trained under this paradigm, Jacobi Forcing Model, achieves 3.8x wall-clock speedup on coding and math benchmarks with minimal loss in performance. Based on Jacobi Forcing Models' trajectory characteristics, we introduce multi-block decoding with rejection recycling, which enables up to 4.5x higher token acceptance count per iteration and nearly 4.0x wall-clock speedup, effectively trading additional compute for lower inference latency. Our code is available at https://github.com/hao-ai-lab/JacobiForcing.
PDF372December 19, 2025