dMoE:帶有可學習區塊專家的dLLMs
dMoE: dLLMs with Learnable Block Experts
May 29, 2026
作者: Sicheng Feng, Zigeng Chen, Gongfan Fang, Xinyin Ma, Xinchao Wang
cs.AI
摘要
擴散大型語言模型(dLLM)近期已成為自回歸模型極具潛力的替代方案,在維持競爭性表現的同時自然支援平行解碼。然而,當dLLM逐漸與混合專家(MoE)架構整合以擴展模型容量時,區塊平行解碼與詞元層級專家選擇之間便產生了根本性不匹配。具體而言,每次dLLM前向傳播需處理具雙向依賴關係的多個詞元,而傳統MoE層則獨立為每個詞元路由專家。此不匹配大幅增加獨特啟動專家的數量,使推論更加受限於記憶體頻寬。為解決此問題,我們提出dMoE——一個簡潔而有效的區塊層級MoE框架。其核心概念是將每個區塊內詞元層級的專家分佈匯總為統一的區塊層級專家分佈,再據此以更一致的方式引導專家路由。透過此方式,dMoE在不犧牲表現的前提下,大幅減少推論過程中獨特啟動專家的數量,從而緩解記憶體頻寬瓶頸。廣泛的基準實驗證實dMoE的有效性:平均而言,dMoE將獨特啟動專家數量從69.5降至14.6,同時保留原始表現的99.11%;記憶體使用量減少76.64%至79.84%,並實現1.14倍至1.66倍的端到端延遲加速。程式碼已公開於:https://github.com/fscdc/dMoE
English
Diffusion Large Language Models (dLLMs) have recently emerged as a promising alternative to autoregressive models, offering competitive performance while naturally supporting parallel decoding. However, as dLLMs are increasingly integrated with Mixture-of-Experts (MoE) architectures to scale model capacity, a fundamental mismatch arises between block parallel decoding and token-level expert selection. Specifically, each dLLM forward pass processes multiple tokens with bidirectional dependencies, whereas conventional MoE layers route each token independently. This mismatch substantially increases the number of uniquely activated experts, making inference increasingly memory-bound. To address this, we propose dMoE, a simple yet effective block-level MoE framework. The central idea of dMoE is to aggregate token-level expert distributions within each block into a unified block-level expert distribution, which is then used to guide expert routing in a more coherent manner. In this way, dMoE substantially reduces the number of uniquely activated experts during inference without sacrificing performance, thereby mitigating the memory-bound bottleneck. Extensive experiments across a variety of benchmarks demonstrate the effectiveness of dMoE. On average, dMoE reduces the number of uniquely activated experts from 69.5 to 14.6 while retaining 99.11% of the original performance. Meanwhile, it reduces memory usage by 76.64% to 79.84% and achieves 1.14times to 1.66times end-to-end latency speedup. Code is available at: https://github.com/fscdc/dMoE