高速かつ正確な因果的並列デコード：ヤコビ強制法を用いたアプローチ

要旨

マルチトークン生成は、トランスフォーマーベースの大規模モデル推論を高速化する有望なパラダイムとして登場している。近年の研究では、推論遅延を低減するために拡散型大規模言語モデル（dLLM）を用いた並列デコードが主に探求されている。ARモデルと同等の生成品質を達成するため、多くの技術がARモデルをdLLMに適応させて並列デコードを可能にしている。しかし、これらの手法は事前学習と事後学習のミスマッチにより、ARモデルと比較して限定的な高速化に留まっている。具体的には、事後学習で用いられるマスクデータ分布は事前学習で観測された実世界データ分布から大きく乖離しており、dLLMが双方向注意機構に依存するため、事前学習で獲得された因果的な事前知識と衝突し、正確なKVキャッシュの再利用の統合を妨げている。この問題に対処するため、我々はJacobi Forcingを提案する。これはモデル自身が生成する並列デコード軌道上で学習を行う漸進的蒸留パラダイムであり、ARモデルを効率的な並列デコーダへと滑らかに移行させながら、事前学習済みの因果推論特性を保持する。このパラダイムで学習されたモデルであるJacobi Forcing Modelは、コーディングおよび数学ベンチマークにおいて、性能劣化を最小限に抑えつつ3.8倍の実時間高速化を達成する。さらに、Jacobi Forcing Modelの軌道特性に基づき、拒否リサイクリングを備えたマルチブロックデコードを導入し、反復あたり最大4.5倍高いトークン受理数と約4.0倍の実時間高速化を実現し、追加計算コストと推論遅延の低減を効果的に交換する。実装はhttps://github.com/hao-ai-lab/JacobiForcingで公開されている。

English

Multi-token generation has emerged as a promising paradigm for accelerating transformer-based large model inference. Recent efforts primarily explore diffusion Large Language Models (dLLMs) for parallel decoding to reduce inference latency. To achieve AR-level generation quality, many techniques adapt AR models into dLLMs to enable parallel decoding. However, they suffer from limited speedup compared to AR models due to a pretrain-to-posttrain mismatch. Specifically, the masked data distribution in post-training deviates significantly from the real-world data distribution seen during pretraining, and dLLMs rely on bidirectional attention, which conflicts with the causal prior learned during pretraining and hinders the integration of exact KV cache reuse. To address this, we introduce Jacobi Forcing, a progressive distillation paradigm where models are trained on their own generated parallel decoding trajectories, smoothly shifting AR models into efficient parallel decoders while preserving their pretrained causal inference property. The models trained under this paradigm, Jacobi Forcing Model, achieves 3.8x wall-clock speedup on coding and math benchmarks with minimal loss in performance. Based on Jacobi Forcing Models' trajectory characteristics, we introduce multi-block decoding with rejection recycling, which enables up to 4.5x higher token acceptance count per iteration and nearly 4.0x wall-clock speedup, effectively trading additional compute for lower inference latency. Our code is available at https://github.com/hao-ai-lab/JacobiForcing.

高速かつ正確な因果的並列デコード：ヤコビ強制法を用いたアプローチ

Fast and Accurate Causal Parallel Decoding using Jacobi Forcing

要旨

Support