SonicMoE：通过IO与分块感知优化加速专家混合模型

SonicMoE: Accelerating MoE with IO and Tile-aware Optimizations

December 16, 2025

作者: Wentao Guo, Mayank Mishra, Xinle Cheng, Ion Stoica, Tri Dao

cs.AI

摘要

专家混合模型已成为扩展语言模型规模而不显著增加计算成本的事实架构。近期MoE模型呈现出明显趋势：专家粒度更细（专家中间维度更小）、稀疏度更高（激活专家数恒定而专家总数增加），从而提升每FLOP的模型质量。然而，细粒度MoE因更高的IO成本导致激活内存占用增加和硬件效率降低，而更稀疏的MoE则因分组GEMM内核中的填充操作产生计算浪费。为此，我们提出一种内存高效算法，通过最小化反向传播的激活缓存来计算MoE的前向与反向传播。我们还设计了可重叠内存IO与计算的GPU内核，使所有MoE架构受益。最后，我们提出新颖的"令牌舍入"方法，最大限度减少分组GEMM内核因填充导致的算力浪费。实验表明，对于细粒度70亿参数MoE模型，SonicMoE相比ScatterMoE的BF16 MoE内核降低45%激活内存，并在Hopper GPU上实现1.86倍计算吞吐量提升。具体而言，在64张H100上使用FSDP-2和lm-engine代码库训练70亿参数MoE时，SonicMoE的日训练吞吐量达2130亿令牌，媲美ScatterMoE在96张H100上的2250亿令牌/日表现。在高MoE稀疏度设置下，我们的分块感知令牌舍入算法相比传统Top-K路由在保持下游性能的同时，可获得内核执行时间1.16倍的额外加速。我们将所有内核开源以促进更高效的MoE模型训练。

English

Mixture of Experts (MoE) models have emerged as the de facto architecture for scaling up language models without significantly increasing the computational cost. Recent MoE models demonstrate a clear trend towards high expert granularity (smaller expert intermediate dimension) and higher sparsity (constant number of activated experts with higher number of total experts), which improve model quality per FLOP. However, fine-grained MoEs suffer from increased activation memory footprint and reduced hardware efficiency due to higher IO costs, while sparser MoEs suffer from wasted computations due to padding in Grouped GEMM kernels. In response, we propose a memory-efficient algorithm to compute the forward and backward passes of MoEs with minimal activation caching for the backward pass. We also design GPU kernels that overlap memory IO with computation benefiting all MoE architectures. Finally, we propose a novel "token rounding" method that minimizes the wasted compute due to padding in Grouped GEMM kernels. As a result, our method SonicMoE reduces activation memory by 45% and achieves a 1.86x compute throughput improvement on Hopper GPUs compared to ScatterMoE's BF16 MoE kernel for a fine-grained 7B MoE. Concretely, SonicMoE on 64 H100s achieves a training throughput of 213 billion tokens per day comparable to ScatterMoE's 225 billion tokens per day on 96 H100s for a 7B MoE model training with FSDP-2 using the lm-engine codebase. Under high MoE sparsity settings, our tile-aware token rounding algorithm yields an additional 1.16x speedup on kernel execution time compared to vanilla top-K routing while maintaining similar downstream performance. We open-source all our kernels to enable faster MoE model training.