SonicMoE: IOとタイルを考慮した最適化によるMoEの高速化

要旨

専門家混合（MoE）モデルは、計算コストを大幅に増加させずに言語モデルのスケールアップを実現するデファクトスタンダードなアーキテクチャとして台頭してきた。最近のMoEモデルでは、専門家の細粒度化（専門家の中間次元の縮小）と高スパース性（活性化専門家数を一定に保ちつつ総専門家数を増加）が明確なトレンドとなっており、FLOP当たりのモデル品質向上が図られている。しかしながら、細粒度MoEは活性化メモリ使用量の増大と高いIOコストによるハードウェア効率の低下に悩まされ、高スパースMoEはGrouped GEMMカーネルにおけるパディングによる計算の無駄が課題となる。これに対し我々は、逆伝播における活性化キャッシュを最小化するメモリ効率の高いMoEの順伝播・逆伝播アルゴリズムを提案する。さらに、あらゆるMoEアーキテクチャで恩恵を得られる、メモリIOと計算をオーバーラップさせるGPUカーネルを設計した。最後に、Grouped GEMMカーネルにおけるパディングによる計算の無駄を最小化する新規の「トークンラウンディング」手法を提案する。結果として、我々の手法SonicMoEは、細粒度7B MoEにおいてScatterMoEのBF16 MoEカーネルと比較し、活性化メモリを45%削減し、1.86倍の計算スループット向上をHopper GPUで達成した。具体的には、FSDP-2を用いた7B MoEモデル訓練において、SonicMoEは64基のH100で1日あたり2130億トークンの訓練スループットを達成し、96基のH100を使用するScatterMoEの2250億トークン/日に匹敵する性能をlm-engineコードベースで実現した。高MoEスパース性設定下では、従来のtop-Kルーティングと同等の下流性能を維持しつつ、タイル対応型トークンラウンディングアルゴリズムがカーネル実行時間でさらに1.16倍の高速化を実現した。我々は全てのカーネルをオープンソース化し、MoEモデル訓練の高速化に貢献する。

English

Mixture of Experts (MoE) models have emerged as the de facto architecture for scaling up language models without significantly increasing the computational cost. Recent MoE models demonstrate a clear trend towards high expert granularity (smaller expert intermediate dimension) and higher sparsity (constant number of activated experts with higher number of total experts), which improve model quality per FLOP. However, fine-grained MoEs suffer from increased activation memory footprint and reduced hardware efficiency due to higher IO costs, while sparser MoEs suffer from wasted computations due to padding in Grouped GEMM kernels. In response, we propose a memory-efficient algorithm to compute the forward and backward passes of MoEs with minimal activation caching for the backward pass. We also design GPU kernels that overlap memory IO with computation benefiting all MoE architectures. Finally, we propose a novel "token rounding" method that minimizes the wasted compute due to padding in Grouped GEMM kernels. As a result, our method SonicMoE reduces activation memory by 45% and achieves a 1.86x compute throughput improvement on Hopper GPUs compared to ScatterMoE's BF16 MoE kernel for a fine-grained 7B MoE. Concretely, SonicMoE on 64 H100s achieves a training throughput of 213 billion tokens per day comparable to ScatterMoE's 225 billion tokens per day on 96 H100s for a 7B MoE model training with FSDP-2 using the lm-engine codebase. Under high MoE sparsity settings, our tile-aware token rounding algorithm yields an additional 1.16x speedup on kernel execution time compared to vanilla top-K routing while maintaining similar downstream performance. We open-source all our kernels to enable faster MoE model training.

SonicMoE: IOとタイルを考慮した最適化によるMoEの高速化

SonicMoE: Accelerating MoE with IO and Tile-aware Optimizations

要旨

Support