S2D2: 훈련 없이 자기 추측을 통한 Diffusion LLM의 고속 디코딩

초록

블록-디퓨전 언어 모델은 블록 단위 자기회귀 디코딩과 블록 내 병렬 디노이징을 결합하여 자기회귀 방식보다 빠른 생성을 가능케 하는 유망한 접근법입니다. 그러나 실제 가속화를 위해 필요한 소수 스텝 체제에서 표준 신뢰도 임계값 기반 디코딩은 종종 취약합니다: 공격적인 임계값은 품질을 저해하는 반면, 보수적인 임계값은 불필요한 디노이징 단계를 요구합니다. 이 문제를 해결하는 기존 방법들은 추가 학습이 필요하거나 테스트 시 추가 계산 비용을 발생시킵니다. 본 논문에서는 블록-디퓨전 언어 모델을 위한 학습 불필요형 자기 스페큘레이티브 디코딩 프레임워크인 S2D2를 제안합니다. 우리의 핵심 관찰은 블록 크기를 1로 줄이면 블록-디퓨전 모델이 자기회귀 모델이 된다는 점으로, 이로 인해 동일한 사전 학습된 모델이 드래프터와 검증자 역할을 모두 수행할 수 있습니다. S2D2는 표준 블록-디퓨전 디코딩 과정에 스페큘레이티브 검증 단계를 삽입하고, 경량 라우팅 정책을 사용하여 검증 비용이 합당한 시점을 결정합니다. 이는 디퓨전이 토큰을 병렬로 제안하는 동시에 자기회귀 모드가 지역적 시퀀스 수준 비평가 역할을 하는 하이브리드 디코딩 궤적을 생성합니다. 세 가지 주류 블록-디퓨전 모델군에서 S2D2는 강력한 신뢰도 임계값 기반 베이스라인 대비 정확도-속도 트레이드오프를 지속적으로 개선했습니다. SDAR에서는 자기회귀 디코딩 대비 최대 4.7배, 조정된 동적 디코딩 베이스라인 대비 최대 1.57배의 속도 향상을 보였으며 정확도는 최대 4.5포인트 향상되었습니다. LLaDA2.1-Mini에서는 S2D2가 내장된 자기 수정 기능과 상호 보완적으로 작동하여, 정적 베이스라인 대비 정확도가 약간 더 높으면서도 4.4배 빠른 보수적 설정에서도 효과를 발휘했습니다.

English

Block-diffusion language models offer a promising path toward faster-than-autoregressive generation by combining block-wise autoregressive decoding with within-block parallel denoising. However, in the few-step regime needed for practical acceleration, standard confidence-thresholded decoding is often brittle: aggressive thresholds hurt quality, while conservative thresholds require unnecessary denoising steps. Existing approaches that address this issue either require additional training or incur extra test-time compute. We present S2D2, a training-free self-speculative decoding framework for block-diffusion language models. Our key observation is that a block-diffusion model becomes autoregressive when the block size is reduced to one, allowing the same pretrained model to act as both drafter and verifier. S2D2 inserts a speculative verification step into standard block-diffusion decoding and uses lightweight routing policies to decide when verification is worth its cost. This yields a hybrid decoding trajectory in which diffusion proposes tokens in parallel, while the autoregressive mode acts as a local sequence-level critic. Across three mainstream block-diffusion families, S2D2 consistently improves the accuracy-speed tradeoff over strong confidence-thresholding baselines. On SDAR, we observe up to 4.7times speedup over autoregressive decoding, and up to 1.57times over a tuned dynamic decoding baseline while improving accuracy by up to 4.5 points. On LLaDA2.1-Mini, S2D2 remains complementary to built-in self-correction, including a conservative setting where it is 4.4times faster than the static baseline with slightly higher accuracy.

S2D2: 훈련 없이 자기 추측을 통한 Diffusion LLM의 고속 디코딩

S2D2: Fast Decoding for Diffusion LLMs via Training-Free Self-Speculation

초록

Support