확산 언어 모델은 디코딩 전에 답을 알고 있습니다

초록

확산 언어 모델(Diffusion Language Models, DLMs)은 최근 자동회귀(autoregressive) 접근법의 대안으로 등장하며, 병렬 시퀀스 생성과 유연한 토큰 순서를 제공합니다. 그러나 이들의 추론 속도는 여전히 자동회귀 모델보다 느린데, 이는 주로 양방향 어텐션의 비용과 고품질 출력을 위해 필요한 많은 수의 정제 단계 때문입니다. 본 연구에서는 DLMs의 간과된 특성인 초기 답변 수렴(early answer convergence)을 강조하고 이를 활용합니다. 많은 경우, 최종 디코딩 단계 이전의 절반 단계에서도 정답이 내부적으로 식별될 수 있으며, 이는 준-자동회귀(semi-autoregressive) 및 랜덤 리마스킹(random remasking) 스케줄 하에서 모두 관찰됩니다. 예를 들어, GSM8K와 MMLU 데이터셋에서 각각 최대 97%와 99%의 사례들이 절반의 정제 단계만으로도 정확하게 디코딩될 수 있습니다. 이러한 관찰을 바탕으로, 우리는 Prophet이라는 추가 학습이 필요 없는 빠른 디코딩 패러다임을 소개합니다. 이는 초기 커밋 디코딩(early commit decoding)을 가능하게 합니다. 구체적으로, Prophet은 상위 2개 예측 후보 간의 신뢰도 격차를 기준으로 정제를 계속할지 또는 "올인(all-in)"(즉, 남은 토큰을 한 단계에서 모두 디코딩)할지를 동적으로 결정합니다. 이는 기존 DLM 구현에 원활하게 통합되며, 미미한 오버헤드만 발생시키고 추가 학습이 필요하지 않습니다. LLaDA-8B와 Dream-7B 모델을 다양한 작업에서 평가한 결과, Prophet은 높은 생성 품질을 유지하면서 디코딩 단계 수를 최대 3.4배까지 줄였습니다. 이러한 결과는 DLM 디코딩을 샘플링을 언제 멈출지의 문제로 재조명하며, 초기 디코딩 수렴이 기존의 속도 향상 기법을 보완하는 간단하지만 강력한 DLM 추론 가속 메커니즘임을 입증합니다. 우리의 코드는 https://github.com/pixeli99/Prophet에서 공개되어 있습니다.

English

Diffusion language models (DLMs) have recently emerged as an alternative to autoregressive approaches, offering parallel sequence generation and flexible token orders. However, their inference remains slower than that of autoregressive models, primarily due to the cost of bidirectional attention and the large number of refinement steps required for high quality outputs. In this work, we highlight and leverage an overlooked property of DLMs early answer convergence: in many cases, the correct answer can be internally identified by half steps before the final decoding step, both under semi-autoregressive and random remasking schedules. For example, on GSM8K and MMLU, up to 97% and 99% of instances, respectively, can be decoded correctly using only half of the refinement steps. Building on this observation, we introduce Prophet, a training-free fast decoding paradigm that enables early commit decoding. Specifically, Prophet dynamically decides whether to continue refinement or to go "all-in" (i.e., decode all remaining tokens in one step), using the confidence gap between the top-2 prediction candidates as the criterion. It integrates seamlessly into existing DLM implementations, incurs negligible overhead, and requires no additional training. Empirical evaluations of LLaDA-8B and Dream-7B across multiple tasks show that Prophet reduces the number of decoding steps by up to 3.4x while preserving high generation quality. These results recast DLM decoding as a problem of when to stop sampling, and demonstrate that early decode convergence provides a simple yet powerful mechanism for accelerating DLM inference, complementary to existing speedup techniques. Our code is publicly available at https://github.com/pixeli99/Prophet.

확산 언어 모델은 디코딩 전에 답을 알고 있습니다

Diffusion Language Models Know the Answer Before Decoding

초록

Support