TIDE:面向I/O感知专家卸载的高效无损MoE扩散语言模型推理
TIDE: Efficient and Lossless MoE Diffusion LLM Inference with I/O-aware Expert Offload
May 19, 2026
作者: Zhiben Chen, Youpeng Zhao, Yang Sui, Jun Wang, Yuzhang Shang
cs.AI
摘要
扩散大语言模型(dLLMs)已成为自回归(AR)模型的有力替代方案,通过并行块级解码提供更优的硬件利用率和双向上下文。然而,随着dLLMs采用混合专家(MoE)架构进行规模化扩展,其在资源受限设备上的部署仍是一个开放挑战。现有的基于AR的方法往往面临高昂的I/O开销或显著的计算瓶颈。本文提出TIDE,一种新型资源高效推理系统,其利用块内扩散过程中专家激活的时间稳定性。具体而言,我们利用该特性,引入基于间隔的专家刷新策略,以I/O感知的方式更新专家布局。为确保最优性能,我们将推理调度建模为数学规划问题,求解最小化I/O流量与CPU计算量的最优间隔。最重要的是,TIDE是一种无损优化方法,无需模型训练,为dLLM推理提供了“免费午餐”式加速。在单GPU-CPU系统中,我们证明TIDE在LLaDA2.0-mini和LLaDA2.0-flash模型上分别实现了相较于先前基线的1.4倍和1.5倍吞吐量提升。
English
Diffusion Large Language Models (dLLMs) have emerged as a competitive alternative to autoregressive (AR) models, offering better hardware utilization and bidirectional context through parallel block-level decoding. However, as dLLMs continue to scale up with mixture-of-experts (MoE) architectures, their deployment on resource-constrained devices remains an open challenge. Existing AR-based methods often incur either prohibitive I/O overhead or significant compute bottlenecks. In this work, we propose TIDE, a novel resource-efficient inference system that leverages the temporal stability of expert activations during the diffusion process within the block. Specifically, we leverage the temporal stability of expert activations during the diffusion process within the block and introduce an interval-based expert refresh strategy that updates the expert placement in an I/O-aware fashion. To ensure optimal performance, we formulate the inference scheduling as a mathematical programming problem, solving for the optimal interval that minimizes I/O traffic and CPU computation. Most importantly, TIDE is a lossless optimization that requires no model training, providing a "free lunch" acceleration for dLLM inference. In a single GPU-CPU system, we demonstrate that TIDE achieves up to 1.4times and 1.5times throughput improvements over prior baselines on LLaDA2.0-mini and LLaDA2.0-flash models, respectively.