TIDE: I/Oを考慮したエキスパートオフロードによる効率的かつロスレスなMoE拡散LLM推論

要旨

拡散大規模言語モデル（Diffusion Large Language Models, dLLMs）は、自己回帰（AR）モデルに代わる競争力のある選択肢として登場し、並列ブロックレベル復号により優れたハードウェア利用率と双方向コンテキストを提供する。しかし、dLLMsが混合エキスパート（MoE）アーキテクチャで大規模化するにつれ、リソース制約のあるデバイスへの展開は依然として未解決の課題である。既存のARベースの手法は、多くの場合、法外なI/Oオーバーヘッドか深刻な計算ボトルネックのいずれかを引き起こす。本研究では、ブロック内の拡散過程におけるエキスパート活性化の時間的安定性を活用した、新しいリソース効率の高い推論システムTIDEを提案する。具体的には、ブロック内の拡散過程におけるエキスパート活性化の時間的安定性に着目し、I/Oを考慮した方法でエキスパート配置を更新するインターバルベースのエキスパートリフレッシュ戦略を導入する。最適な性能を確保するため、推論スケジューリングを数理計画問題として定式化し、I/OトラフィックとCPU計算を最小化する最適な間隔を求解する。最も重要な点として、TIDEはロスレス最適化であり、モデルのトレーニングを必要とせず、dLLM推論に「フリーランチ」の高速化を提供する。単一GPU-CPUシステムにおいて、TIDEはLLaDA2.0-miniおよびLLaDA2.0-flashモデルで、従来のベースラインと比較してそれぞれ最大1.4倍、1.5倍のスループット向上を達成することを示す。

English

Diffusion Large Language Models (dLLMs) have emerged as a competitive alternative to autoregressive (AR) models, offering better hardware utilization and bidirectional context through parallel block-level decoding. However, as dLLMs continue to scale up with mixture-of-experts (MoE) architectures, their deployment on resource-constrained devices remains an open challenge. Existing AR-based methods often incur either prohibitive I/O overhead or significant compute bottlenecks. In this work, we propose TIDE, a novel resource-efficient inference system that leverages the temporal stability of expert activations during the diffusion process within the block. Specifically, we leverage the temporal stability of expert activations during the diffusion process within the block and introduce an interval-based expert refresh strategy that updates the expert placement in an I/O-aware fashion. To ensure optimal performance, we formulate the inference scheduling as a mathematical programming problem, solving for the optimal interval that minimizes I/O traffic and CPU computation. Most importantly, TIDE is a lossless optimization that requires no model training, providing a "free lunch" acceleration for dLLM inference. In a single GPU-CPU system, we demonstrate that TIDE achieves up to 1.4times and 1.5times throughput improvements over prior baselines on LLaDA2.0-mini and LLaDA2.0-flash models, respectively.