야코비 강제를 이용한 빠르고 정확한 인과적 병렬 디코딩

초록

멀티 토큰 생성은 트랜스포머 기반 대형 모델 추론 속도 향상을 위한 유망한 패러다임으로 부상하고 있습니다. 최근 연구는 주로 병렬 디코딩을 통해 추론 지연 시간을 줄이기 위한 확산 대형 언어 모델(dLLMs)을 탐구합니다. AR 수준의 생성 품질을 달성하기 위해 많은 기법들이 AR 모델을 dLLMs로 변환하여 병렬 디코딩을 가능하게 합니다. 그러나 이러한 방법들은 사전 학습과 사후 학습 간의 불일치로 인해 AR 모델 대비 제한된 속도 향상만을 보입니다. 구체적으로, 사후 학습에서 사용된 마스킹된 데이터 분포는 사전 학습 시 접한 실제 데이터 분포와 크게 차이가 나며, dLLMs는 양방향 어텐션에 의존하기 때문에 사전 학습期间 습득된 인과적 사전 지식과 충돌하여 정확한 KV 캐시 재사용의 통합을 방해합니다. 이를 해결하기 위해 우리는 Jacobi Forcing을 도입합니다. 이는 점진적 지식 증류 패러다임으로, 모델이 자신이 생성한 병렬 디코딩 궤적에 대해 학습되며, 사전 학습된 인과적 추론 특성을 보존하면서 AR 모델을 효율적인 병렬 디코더로 원활하게 전환합니다. 이 패러다임 아래 훈련된 모델인 Jacobi Forcing Model은 코딩 및 수학 벤치마크에서 성능 저하를 최소화하면서 월클럭 기준 3.8배의 추론 가속을 달성했습니다. 또한 Jacobi Forcing Model의 궤적 특성에 기반하여, 우리는 반복당 최대 4.5배 높은 토큰 수용량과 약 4.0배의 월클럭 가속을 가능하게 하는 기각 재순환 다중 블록 디코딩을 도입하여 추가 계산을 통해 추론 지연 시간을 효과적으로 낮춥니다. 우리의 코드는 https://github.com/hao-ai-lab/JacobiForcing에서 확인할 수 있습니다.

English

Multi-token generation has emerged as a promising paradigm for accelerating transformer-based large model inference. Recent efforts primarily explore diffusion Large Language Models (dLLMs) for parallel decoding to reduce inference latency. To achieve AR-level generation quality, many techniques adapt AR models into dLLMs to enable parallel decoding. However, they suffer from limited speedup compared to AR models due to a pretrain-to-posttrain mismatch. Specifically, the masked data distribution in post-training deviates significantly from the real-world data distribution seen during pretraining, and dLLMs rely on bidirectional attention, which conflicts with the causal prior learned during pretraining and hinders the integration of exact KV cache reuse. To address this, we introduce Jacobi Forcing, a progressive distillation paradigm where models are trained on their own generated parallel decoding trajectories, smoothly shifting AR models into efficient parallel decoders while preserving their pretrained causal inference property. The models trained under this paradigm, Jacobi Forcing Model, achieves 3.8x wall-clock speedup on coding and math benchmarks with minimal loss in performance. Based on Jacobi Forcing Models' trajectory characteristics, we introduce multi-block decoding with rejection recycling, which enables up to 4.5x higher token acceptance count per iteration and nearly 4.0x wall-clock speedup, effectively trading additional compute for lower inference latency. Our code is available at https://github.com/hao-ai-lab/JacobiForcing.

야코비 강제를 이용한 빠르고 정확한 인과적 병렬 디코딩

Fast and Accurate Causal Parallel Decoding using Jacobi Forcing

초록

Support