TIDE: I/O 인식 전문가 오프로드를 통한 효율적이고 손실 없는 MoE Diffusion LLM 추론

초록

확산 대규모 언어 모델(dLLM)은 자기회귀(AR) 모델의 경쟁적 대안으로 부상했으며, 병렬 블록 수준 디코딩을 통해 더 나은 하드웨어 활용과 양방향 컨텍스트를 제공한다. 그러나 dLLM이 혼합 전문가(MoE) 구조로 확장됨에 따라, 자원이 제한된 장치에서의 배치는 여전히 해결되지 않은 과제로 남아 있다. 기존 AR 기반 방법은 종종 과도한 I/O 오버헤드나 심각한 연산 병목 현상을 초래한다. 본 연구에서는 블록 내 확산 과정 중 전문가 활성화의 시간적 안정성을 활용하는 새로운 자원 효율적 추론 시스템인 TIDE를 제안한다. 구체적으로, 블록 내 확산 과정 중 전문가 활성화의 시간적 안정성을 활용하고, I/O를 고려한 방식으로 전문가 배치를 갱신하는 간격 기반 전문가 갱신 전략을 도입한다. 최적의 성능을 보장하기 위해 추론 스케줄링을 수학적 프로그래밍 문제로 공식화하고, I/O 트래픽과 CPU 연산을 최소화하는 최적의 간격을 도출한다. 가장 중요하게도, TIDE는 모델 훈련이 필요 없는 무손실 최적화로, dLLM 추론에 '공짜 점심' 가속을 제공한다. 단일 GPU-CPU 시스템에서 TIDE는 LLaDA2.0-mini와 LLaDA2.0-flash 모델에 대해 기존 기준선 대비 각각 최대 1.4배 및 1.5배의 처리량 향상을 달성함을 입증한다.

English

Diffusion Large Language Models (dLLMs) have emerged as a competitive alternative to autoregressive (AR) models, offering better hardware utilization and bidirectional context through parallel block-level decoding. However, as dLLMs continue to scale up with mixture-of-experts (MoE) architectures, their deployment on resource-constrained devices remains an open challenge. Existing AR-based methods often incur either prohibitive I/O overhead or significant compute bottlenecks. In this work, we propose TIDE, a novel resource-efficient inference system that leverages the temporal stability of expert activations during the diffusion process within the block. Specifically, we leverage the temporal stability of expert activations during the diffusion process within the block and introduce an interval-based expert refresh strategy that updates the expert placement in an I/O-aware fashion. To ensure optimal performance, we formulate the inference scheduling as a mathematical programming problem, solving for the optimal interval that minimizes I/O traffic and CPU computation. Most importantly, TIDE is a lossless optimization that requires no model training, providing a "free lunch" acceleration for dLLM inference. In a single GPU-CPU system, we demonstrate that TIDE achieves up to 1.4times and 1.5times throughput improvements over prior baselines on LLaDA2.0-mini and LLaDA2.0-flash models, respectively.