TIDE：基於I/O感知專家卸載的高效無損MoE擴散LLM推理

摘要

扩散大语言模型（dLLMs）已成为自回归（AR）模型的有力替代方案，通过并行块级解码实现了更优的硬件利用率与双向上下文建模。然而，随着dLLMs采用混合专家（MoE）架构不断扩展规模，其在资源受限设备上的部署仍是一项开放挑战。现有基于AR的方法要么带来巨大的I/O开销，要么造成显著的计算瓶颈。本文提出TIDE——一种新型资源高效推理系统，其核心创新在于利用专家激活在块内扩散过程中的时间稳定性。具体而言，我们基于块内扩散过程中专家激活的时间稳定性，提出一种基于时间间隔的专家刷新策略，能以I/O感知方式更新专家布局。为确保最优性能，我们将推理调度建模为数学规划问题，求解最小化I/O流量与CPU计算量的最优时间间隔。最重要的是，TIDE是一种无需模型训练的无损优化方案，为dLLM推理提供了"免费午餐"式加速。在单一GPU-CPU系统中，我们证明TIDE在LLaDA2.0-mini和LLaDA2.0-flash模型上相较于先前基线分别实现了最高1.4倍和1.5倍的吞吐量提升。

English

Diffusion Large Language Models (dLLMs) have emerged as a competitive alternative to autoregressive (AR) models, offering better hardware utilization and bidirectional context through parallel block-level decoding. However, as dLLMs continue to scale up with mixture-of-experts (MoE) architectures, their deployment on resource-constrained devices remains an open challenge. Existing AR-based methods often incur either prohibitive I/O overhead or significant compute bottlenecks. In this work, we propose TIDE, a novel resource-efficient inference system that leverages the temporal stability of expert activations during the diffusion process within the block. Specifically, we leverage the temporal stability of expert activations during the diffusion process within the block and introduce an interval-based expert refresh strategy that updates the expert placement in an I/O-aware fashion. To ensure optimal performance, we formulate the inference scheduling as a mathematical programming problem, solving for the optimal interval that minimizes I/O traffic and CPU computation. Most importantly, TIDE is a lossless optimization that requires no model training, providing a "free lunch" acceleration for dLLM inference. In a single GPU-CPU system, we demonstrate that TIDE achieves up to 1.4times and 1.5times throughput improvements over prior baselines on LLaDA2.0-mini and LLaDA2.0-flash models, respectively.