dMoE:具有可学习块专家的dLLM
dMoE: dLLMs with Learnable Block Experts
May 29, 2026
作者: Sicheng Feng, Zigeng Chen, Gongfan Fang, Xinyin Ma, Xinchao Wang
cs.AI
摘要
扩散大语言模型(dLLMs)近期作为自回归模型的有力替代方案崭露头角,在保持竞争性能的同时天然支持并行解码。然而,随着dLLMs与混合专家(MoE)架构的深度融合以扩展模型容量,块并行解码与令牌级专家选择之间出现了根本性不匹配。具体而言,每次dLLM前向传播需处理具有双向依赖关系的多个令牌,而传统MoE层则对每个令牌独立进行路由。这种不匹配显著增加了唯一激活专家的数量,使得推理过程愈发受内存限制。为解决这一问题,我们提出dMoE——一种简洁而有效的块级MoE框架。dMoE的核心思想是:将每个块内的令牌级专家分布聚合为统一的块级专家分布,并以此引导更一致的专家路由。通过这种方式,dMoE在保持性能不变的前提下大幅减少推理过程中唯一激活专家的数量,从而缓解内存限制瓶颈。在多种基准上的大量实验验证了dMoE的有效性。平均而言,dMoE将唯一激活专家数从69.5降至14.6,同时保留了原模型99.11%的性能。此外,它减少了76.64%至79.84%的内存使用,并实现了1.14倍至1.66倍的端到端延迟加速。代码已开源:https://github.com/fscdc/dMoE
English
Diffusion Large Language Models (dLLMs) have recently emerged as a promising alternative to autoregressive models, offering competitive performance while naturally supporting parallel decoding. However, as dLLMs are increasingly integrated with Mixture-of-Experts (MoE) architectures to scale model capacity, a fundamental mismatch arises between block parallel decoding and token-level expert selection. Specifically, each dLLM forward pass processes multiple tokens with bidirectional dependencies, whereas conventional MoE layers route each token independently. This mismatch substantially increases the number of uniquely activated experts, making inference increasingly memory-bound. To address this, we propose dMoE, a simple yet effective block-level MoE framework. The central idea of dMoE is to aggregate token-level expert distributions within each block into a unified block-level expert distribution, which is then used to guide expert routing in a more coherent manner. In this way, dMoE substantially reduces the number of uniquely activated experts during inference without sacrificing performance, thereby mitigating the memory-bound bottleneck. Extensive experiments across a variety of benchmarks demonstrate the effectiveness of dMoE. On average, dMoE reduces the number of uniquely activated experts from 69.5 to 14.6 while retaining 99.11% of the original performance. Meanwhile, it reduces memory usage by 76.64% to 79.84% and achieves 1.14times to 1.66times end-to-end latency speedup. Code is available at: https://github.com/fscdc/dMoE