エキスパートチョイスルーティングによる拡散言語モデルの適応的計算の実現

要旨

拡散言語モデル（DLM）は並列的な非自己回帰的テキスト生成を可能にするが、既存のDLM混合専門家（MoE）モデルは、自己回帰システムからトークン選択（TC）ルーティングを継承しており、負荷不均衡と硬直的な計算割り当てを引き起こしている。我々は、専門家選択（EC）ルーティングがDLMにより適していることを示す：ECは設計上決定論的な負荷分散を提供し、TCよりも高いスループットと高速な収束を実現する。ECの容量が外部的に制御可能であるという特性に基づき、我々はタイムステップ依存の専門家容量を導入する。これはノイズ除去ステップに応じて専門家の割り当てを変化させる。マスク比率が低いステップにより多くの容量を割り当てることが、FLOPsが同等の条件下で一貫して最高の性能を達成することを見出し、その機構的説明を提供する：低マスク比率の文脈におけるトークンは学習効率が桁違いに高く、これらのステップに計算リソースを集中させることで最大の限界利益が得られる。最後に、既存の事前学習済みTC DLMは、ルーターのみを交換することでECに改造可能であり、多様な下流タスクにおいて高速な収束と精度向上を実現することを示す。これらの結果を総合すると、ECルーティングはDLM MoEモデルの優れたパラダイムであり、DLMにおける計算は固定的なアーキテクチャ定数ではなく適応的なポリシーとして扱えることが実証される。コードはhttps://github.com/zhangshuibai/EC-DLM で公開されている。

English

Diffusion language models (DLMs) enable parallel, non-autoregressive text generation, yet existing DLM mixture-of-experts (MoE) models inherit token-choice (TC) routing from autoregressive systems, leading to load imbalance and rigid computation allocation. We show that expert-choice (EC) routing is a better fit for DLMs: it provides deterministic load balancing by design, yielding higher throughput and faster convergence than TC. Building on the property that EC capacity is externally controllable, we introduce timestep-dependent expert capacity, which varies expert allocation according to the denoising step. We find that allocating more capacity to low-mask-ratio steps consistently achieves the best performance under matched FLOPs, and provide a mechanistic explanation: tokens in low-mask-ratio contexts exhibit an order-of-magnitude higher learning efficiency, so concentrating compute on these steps yields the largest marginal return. Finally, we show that existing pretrained TC DLMs can be retrofitted to EC by replacing only the router, achieving faster convergence and improved accuracy across diverse downstream tasks. Together, these results establish EC routing as a superior paradigm for DLM MoE models and demonstrate that computation in DLMs can be treated as an adaptive policy rather than a fixed architectural constant. Code is available at https://github.com/zhangshuibai/EC-DLM.

エキスパートチョイスルーティングによる拡散言語モデルの適応的計算の実現

Expert-Choice Routing Enables Adaptive Computation in Diffusion Language Models

要旨

Support