dMoE: 학습 가능한 블록 전문가를 갖춘 dLLM

초록

확산 대규모 언어 모델(dLLM)은 최근 자기회귀 모델의 유망한 대안으로 부상하며, 자연스럽게 병렬 디코딩을 지원하면서도 경쟁력 있는 성능을 제공하고 있다. 그러나 dLLM이 전문가 혼합(MoE) 아키텍처와 통합되어 모델 용량을 확장함에 따라, 블록 병렬 디코딩과 토큰 수준 전문가 선택 사이에 근본적인 불일치가 발생한다. 구체적으로, 각 dLLM 순방향 패스는 양방향 의존성을 가진 여러 토큰을 처리하는 반면, 기존 MoE 계층은 각 토큰을 독립적으로 라우팅한다. 이러한 불일치는 고유하게 활성화된 전문가의 수를 크게 증가시켜 추론을 점점 더 메모리 병목 상태로 만든다. 이를 해결하기 위해, 우리는 간단하면서도 효과적인 블록 수준 MoE 프레임워크인 dMoE를 제안한다. dMoE의 핵심 아이디어는 각 블록 내의 토큰 수준 전문가 분포를 통합된 블록 수준 전문가 분포로 집계한 후, 이를 사용하여 보다 일관된 방식으로 전문가 라우팅을 안내하는 것이다. 이러한 방식으로 dMoE는 성능 저하 없이 추론 중 고유하게 활성화된 전문가 수를 크게 줄여 메모리 병목 현상을 완화한다. 다양한 벤치마크에 걸친 광범위한 실험은 dMoE의 효과성을 입증한다. 평균적으로 dMoE는 고유하게 활성화된 전문가 수를 69.5에서 14.6으로 줄이면서 원래 성능의 99.11%를 유지한다. 동시에 메모리 사용량을 76.64%에서 79.84%까지 줄이고, 엔드투엔드 지연 시간을 1.14배에서 1.66배까지 가속화한다. 코드는 다음에서 확인할 수 있다: https://github.com/fscdc/dMoE

English

Diffusion Large Language Models (dLLMs) have recently emerged as a promising alternative to autoregressive models, offering competitive performance while naturally supporting parallel decoding. However, as dLLMs are increasingly integrated with Mixture-of-Experts (MoE) architectures to scale model capacity, a fundamental mismatch arises between block parallel decoding and token-level expert selection. Specifically, each dLLM forward pass processes multiple tokens with bidirectional dependencies, whereas conventional MoE layers route each token independently. This mismatch substantially increases the number of uniquely activated experts, making inference increasingly memory-bound. To address this, we propose dMoE, a simple yet effective block-level MoE framework. The central idea of dMoE is to aggregate token-level expert distributions within each block into a unified block-level expert distribution, which is then used to guide expert routing in a more coherent manner. In this way, dMoE substantially reduces the number of uniquely activated experts during inference without sacrificing performance, thereby mitigating the memory-bound bottleneck. Extensive experiments across a variety of benchmarks demonstrate the effectiveness of dMoE. On average, dMoE reduces the number of uniquely activated experts from 69.5 to 14.6 while retaining 99.11% of the original performance. Meanwhile, it reduces memory usage by 76.64% to 79.84% and achieves 1.14times to 1.66times end-to-end latency speedup. Code is available at: https://github.com/fscdc/dMoE