SonicMoE: IO 및 타일 인식 최적화를 통한 MoE 가속

초록

전문가 혼합(MoE) 모델은 계산 비용을 크게 증가시키지 않고 언어 모델의 규모를 확장하기 위한 사실상의 표준 아키텍처로 부상했습니다. 최근 MoE 모델은 높은 전문가 세분화(더 작은 전문가 중간 차원)와 더 높은 희소성(전체 전문가 수는 더 많으면서 활성화되는 전문가 수는 일정)을 향한 뚜렷한 추세를 보이며, FLOP당 모델 품질을 향상시키고 있습니다. 그러나 세분화된 MoE는 높은 IO 비용으로 인해 활성화 메모리 사용량이 증가하고 하드웨어 효율이 저하되는 문제가 있으며, 더 희소한 MoE는 Grouped GEMM 커널 내 패딩으로 인한 계산 낭비 문제가 있습니다. 이에 대응하여, 우리는 역전파를 위한 활성화 캐싱을 최소화하면서 MoE의 순전파와 역전파를 계산하는 메모리 효율적인 알고리즘을 제안합니다. 또한 모든 MoE 아키텍처에 도움이 되는 메모리 IO와 계산을 중첩시키는 GPU 커널을 설계합니다. 마지막으로, Grouped GEMM 커널 내 패딩으로 인한 계산 낭비를 최소화하는 새로운 "토큰 라운딩" 방법을 제안합니다. 그 결과, 우리의 방법인 SonicMoE는 세분화된 7B MoE에 대해 ScatterMoE의 BF16 MoE 커널 대비 활성화 메모리를 45% 절감하고 Hopper GPU에서 1.86배의 계산 처리량 향상을 달성했습니다. 구체적으로, lm-engine 코드베이스와 FSDP-2를 사용한 7B MoE 모델 학습에서 SonicMoE는 H100 64개로 일일 2,130억 토큰의 학습 처리량을 달성하며, 이는 H100 96개를 사용하는 ScatterMoE의 일일 2,250억 토큰 처리량에 버금가는 성능입니다. 높은 MoE 희소성 설정에서 우리의 타일 인식 토큰 라운딩 알고리즘은 기존 상위-K 라우팅 대비 유사한 다운스트림 성능을 유지하면서 커널 실행 시간에서 추가로 1.16배의 속도 향상을 가져옵니다. 더 빠른 MoE 모델 학습을 위해 모든 커널을 오픈소스로 공개합니다.

English

Mixture of Experts (MoE) models have emerged as the de facto architecture for scaling up language models without significantly increasing the computational cost. Recent MoE models demonstrate a clear trend towards high expert granularity (smaller expert intermediate dimension) and higher sparsity (constant number of activated experts with higher number of total experts), which improve model quality per FLOP. However, fine-grained MoEs suffer from increased activation memory footprint and reduced hardware efficiency due to higher IO costs, while sparser MoEs suffer from wasted computations due to padding in Grouped GEMM kernels. In response, we propose a memory-efficient algorithm to compute the forward and backward passes of MoEs with minimal activation caching for the backward pass. We also design GPU kernels that overlap memory IO with computation benefiting all MoE architectures. Finally, we propose a novel "token rounding" method that minimizes the wasted compute due to padding in Grouped GEMM kernels. As a result, our method SonicMoE reduces activation memory by 45% and achieves a 1.86x compute throughput improvement on Hopper GPUs compared to ScatterMoE's BF16 MoE kernel for a fine-grained 7B MoE. Concretely, SonicMoE on 64 H100s achieves a training throughput of 213 billion tokens per day comparable to ScatterMoE's 225 billion tokens per day on 96 H100s for a 7B MoE model training with FSDP-2 using the lm-engine codebase. Under high MoE sparsity settings, our tile-aware token rounding algorithm yields an additional 1.16x speedup on kernel execution time compared to vanilla top-K routing while maintaining similar downstream performance. We open-source all our kernels to enable faster MoE model training.

SonicMoE: IO 및 타일 인식 최적화를 통한 MoE 가속

SonicMoE: Accelerating MoE with IO and Tile-aware Optimizations

초록

Support