SonicMoE：透過IO與分塊感知優化加速混合專家模型

SonicMoE: Accelerating MoE with IO and Tile-aware Optimizations

December 16, 2025

作者: Wentao Guo, Mayank Mishra, Xinle Cheng, Ion Stoica, Tri Dao

cs.AI

摘要

混合專家模型已成為在不顯著增加計算成本的前提下擴展語言模型的實際架構。近期MoE模型呈現出明確的技術趨勢：高專家粒度化（更小的專家中間維度）與更高稀疏度（固定激活專家數量配合更多總專家數），這使得每FLOP的模型質量得到提升。然而細粒度MoE面臨激活記憶體佔用增加與因更高IO成本導致的硬體效率下降，而更稀疏的MoE則因分組GEMM核心中的填充操作產生計算浪費。為此，我們提出一種記憶體高效演算法，能以最小化反向傳播激活快取的方式計算MoE的前向與反向傳播。同時設計了能重疊記憶體IO與計算的GPU核心，使所有MoE架構受益。最後，我們創新型「token取整」方法可最大限度減少分組GEMM核心中填充操作導致的計算浪費。實驗結果表明，對於細粒度70億參數MoE模型，我們的SonicMoE方法相比ScatterMoE的BF16 MoE核心，可降低45%激活記憶體佔用，並在Hopper GPU上實現1.86倍計算吞吐量提升。具體而言，在64張H100上運行的SonicMoE每日訓練吞吐量達2130億token，與在96張H100上使用FSDP-2和lm-engine程式庫訓練70億參數MoE模型的ScatterMoE（2250億token/日）相當。在高MoE稀疏度設定下，我們具備分塊感知的token取整演算法相較傳統top-K路由機制，在保持相近下游性能的同時，可實現核心執行時間額外1.16倍加速。我們已開源所有核心程式碼以促進更高效的MoE模型訓練。

English

Mixture of Experts (MoE) models have emerged as the de facto architecture for scaling up language models without significantly increasing the computational cost. Recent MoE models demonstrate a clear trend towards high expert granularity (smaller expert intermediate dimension) and higher sparsity (constant number of activated experts with higher number of total experts), which improve model quality per FLOP. However, fine-grained MoEs suffer from increased activation memory footprint and reduced hardware efficiency due to higher IO costs, while sparser MoEs suffer from wasted computations due to padding in Grouped GEMM kernels. In response, we propose a memory-efficient algorithm to compute the forward and backward passes of MoEs with minimal activation caching for the backward pass. We also design GPU kernels that overlap memory IO with computation benefiting all MoE architectures. Finally, we propose a novel "token rounding" method that minimizes the wasted compute due to padding in Grouped GEMM kernels. As a result, our method SonicMoE reduces activation memory by 45% and achieves a 1.86x compute throughput improvement on Hopper GPUs compared to ScatterMoE's BF16 MoE kernel for a fine-grained 7B MoE. Concretely, SonicMoE on 64 H100s achieves a training throughput of 213 billion tokens per day comparable to ScatterMoE's 225 billion tokens per day on 96 H100s for a 7B MoE model training with FSDP-2 using the lm-engine codebase. Under high MoE sparsity settings, our tile-aware token rounding algorithm yields an additional 1.16x speedup on kernel execution time compared to vanilla top-K routing while maintaining similar downstream performance. We open-source all our kernels to enable faster MoE model training.