专家选择路由机制实现扩散语言模型的自适应计算

摘要

扩散语言模型（DLMs）实现了并行非自回归文本生成，但现有DLM专家混合（MoE）模型沿用了自回归系统的令牌选择（TC）路由机制，导致负载不均与计算分配僵化。我们证明专家选择（EC）路由更适合DLMs：其通过设计实现确定性负载均衡，相比TC路由可获得更高吞吐量与更快收敛速度。基于EC容量可外部调控的特性，我们引入了时序相关专家容量机制，使专家分配随去噪步骤动态调整。研究发现，在固定浮点运算量条件下，为低掩码率步骤分配更多容量能持续获得最佳性能，并给出机制性解释：低掩码率语境中的令牌学习效率呈现数量级优势，因此将计算资源集中于这些步骤可产生最大边际收益。最后我们证明，仅需替换路由模块即可将现有预训练TC-DLM改造为EC架构，在多类下游任务中实现加速收敛与精度提升。这些成果共同确立了EC路由作为DLM-MoE模型的更优范式，并表明DLM中的计算可视为自适应策略而非固定架构常数。代码详见https://github.com/zhangshuibai/EC-DLM。

English

Diffusion language models (DLMs) enable parallel, non-autoregressive text generation, yet existing DLM mixture-of-experts (MoE) models inherit token-choice (TC) routing from autoregressive systems, leading to load imbalance and rigid computation allocation. We show that expert-choice (EC) routing is a better fit for DLMs: it provides deterministic load balancing by design, yielding higher throughput and faster convergence than TC. Building on the property that EC capacity is externally controllable, we introduce timestep-dependent expert capacity, which varies expert allocation according to the denoising step. We find that allocating more capacity to low-mask-ratio steps consistently achieves the best performance under matched FLOPs, and provide a mechanistic explanation: tokens in low-mask-ratio contexts exhibit an order-of-magnitude higher learning efficiency, so concentrating compute on these steps yields the largest marginal return. Finally, we show that existing pretrained TC DLMs can be retrofitted to EC by replacing only the router, achieving faster convergence and improved accuracy across diverse downstream tasks. Together, these results establish EC routing as a superior paradigm for DLM MoE models and demonstrate that computation in DLMs can be treated as an adaptive policy rather than a fixed architectural constant. Code is available at https://github.com/zhangshuibai/EC-DLM.