CMoE:快速雕刻混合專家以進行高效的LLM推論
CMoE: Fast Carving of Mixture-of-Experts for Efficient LLM Inference
February 6, 2025
作者: Zehua Pei, Lancheng Zou, Hui-Ling Zhen, Xianzhi Yu, Wulong Liu, Sinno Jialin Pan, Mingxuan Yuan, Bei Yu
cs.AI
摘要
大型語言模型(LLMs)通過擴展模型參數來實現出色的性能,但這導致了顯著的推理開銷。主導LLM參數的前饋網絡(FFNs)在隱藏神經元中表現出高激活稀疏性。為了利用這一點,研究人員提出了使用專家混合(MoE)架構,其中僅激活部分參數。然而,現有方法通常需要大量的訓練數據和資源,限制了它們的實用性。我們提出了CMoE(Carved MoE),一個新穎的框架,可以從密集模型中高效地雕刻MoE模型。CMoE通過高效的專家分組和輕量級適應實現了出色的性能。首先,基於激活率,將神經元分為共享和路由專家組。接下來,我們構建了一個無需從頭訓練的路由機制,融入了可微分的路由過程和負載平衡。使用適度的數據,CMoE可以在五分鐘內從7B密集模型中生成一個設計良好、可用的MoE。通過輕量級微調,它可以在不到一小時內實現高性能恢復。我們將我們的代碼公開發布在https://github.com/JarvisPei/CMoE。
English
Large language models (LLMs) achieve impressive performance by scaling model
parameters, but this comes with significant inference overhead. Feed-forward
networks (FFNs), which dominate LLM parameters, exhibit high activation
sparsity in hidden neurons. To exploit this, researchers have proposed using a
mixture-of-experts (MoE) architecture, where only a subset of parameters is
activated. However, existing approaches often require extensive training data
and resources, limiting their practicality. We propose CMoE (Carved MoE), a
novel framework to efficiently carve MoE models from dense models. CMoE
achieves remarkable performance through efficient expert grouping and
lightweight adaptation. First, neurons are grouped into shared and routed
experts based on activation rates. Next, we construct a routing mechanism
without training from scratch, incorporating a differentiable routing process
and load balancing. Using modest data, CMoE produces a well-designed, usable
MoE from a 7B dense model within five minutes. With lightweight fine-tuning, it
achieves high-performance recovery in under an hour. We make our code publicly
available at https://github.com/JarvisPei/CMoE.Summary
AI-Generated Summary