블록 확산 드래프트 트리를 활용한 추론 디코딩 가속

초록

추측적 디코딩은 경량 드래프터를 사용해 여러 미래 토큰을 제안하고 대상 모델이 이를 병렬로 검증함으로써 자기회귀 언어 모델의 속도를 향상시킵니다. DFlash는 블록 확산 드래프터가 단일 순전파 패스로 전체 초안 블록을 생성하고 EAGLE-3와 같은 강력한 자기회귀 드래프터를 능가하는 최첨단 추측적 디코딩 성능을 달성할 수 있음을 보여줍니다. 그러나 기본 DFlash는 여전히 라운드당 단일 초안 경로만 검증하여 수용 길이 제한이 발생할 수 있습니다. 우리는 블록 확산 드래프터의 위치별 분포에서 직접 초안 트리를 구성하는 방법인 DDTree(Diffusion Draft Tree)를 소개합니다. 고정된 노드 예산 하에서 DDTree는 드래프트 모델 출력으로 정의된 대리 지표에 따라 대상 모델과 일치할 가능성이 가장 높은 후속 토큰을 선택하기 위해 간단한 최적우선 힙 알고리즘을 사용합니다. 결과적으로 생성된 트리는 상위 노드만 참조하는 어텐션 마스크를 사용해 단일 대상 모델 순전파 패스로 효율적으로 검증됩니다. DDTree가 추측적 디코딩을 위한 선도적인 드래프트 모델인 DFlash에 기반을 두고 있으므로, 이러한 성능 향상은 DDTree를 추측적 디코딩 분야의 최첨단 접근법 중 하나로 위치시킵니다.

English

Speculative decoding accelerates autoregressive language models by using a lightweight drafter to propose multiple future tokens, which the target model then verifies in parallel. DFlash shows that a block diffusion drafter can generate an entire draft block in a single forward pass and achieve state-of-the-art speculative decoding performance, outperforming strong autoregressive drafters such as EAGLE-3. Vanilla DFlash, however, still verifies only a single drafted trajectory per round, potentially limiting its acceptance length. We introduce DDTree (Diffusion Draft Tree), a method that constructs a draft tree directly from the per-position distributions of a block diffusion drafter. Under a fixed node budget, DDTree uses a simple best-first heap algorithm to select the continuations that are most likely to match the target model according to a surrogate defined by the draft model's output. The resulting tree is verified efficiently in a single target model forward pass using an ancestor-only attention mask. Because DDTree builds on DFlash, a leading draft model for speculative decoding, these gains place DDTree among the leading approaches to speculative decoding.

블록 확산 드래프트 트리를 활용한 추론 디코딩 가속

Accelerating Speculative Decoding with Block Diffusion Draft Trees

초록

Support