CMoE：快速雕刻混合專家以進行高效的LLM推論

摘要

大型語言模型（LLMs）通過擴展模型參數來實現出色的性能，但這導致了顯著的推理開銷。主導LLM參數的前饋網絡（FFNs）在隱藏神經元中表現出高激活稀疏性。為了利用這一點，研究人員提出了使用專家混合（MoE）架構，其中僅激活部分參數。然而，現有方法通常需要大量的訓練數據和資源，限制了它們的實用性。我們提出了CMoE（Carved MoE），一個新穎的框架，可以從密集模型中高效地雕刻MoE模型。CMoE通過高效的專家分組和輕量級適應實現了出色的性能。首先，基於激活率，將神經元分為共享和路由專家組。接下來，我們構建了一個無需從頭訓練的路由機制，融入了可微分的路由過程和負載平衡。使用適度的數據，CMoE可以在五分鐘內從7B密集模型中生成一個設計良好、可用的MoE。通過輕量級微調，它可以在不到一小時內實現高性能恢復。我們將我們的代碼公開發布在https://github.com/JarvisPei/CMoE。

English

Large language models (LLMs) achieve impressive performance by scaling model parameters, but this comes with significant inference overhead. Feed-forward networks (FFNs), which dominate LLM parameters, exhibit high activation sparsity in hidden neurons. To exploit this, researchers have proposed using a mixture-of-experts (MoE) architecture, where only a subset of parameters is activated. However, existing approaches often require extensive training data and resources, limiting their practicality. We propose CMoE (Carved MoE), a novel framework to efficiently carve MoE models from dense models. CMoE achieves remarkable performance through efficient expert grouping and lightweight adaptation. First, neurons are grouped into shared and routed experts based on activation rates. Next, we construct a routing mechanism without training from scratch, incorporating a differentiable routing process and load balancing. Using modest data, CMoE produces a well-designed, usable MoE from a 7B dense model within five minutes. With lightweight fine-tuning, it achieves high-performance recovery in under an hour. We make our code publicly available at https://github.com/JarvisPei/CMoE.

CMoE：快速雕刻混合專家以進行高效的LLM推論

CMoE: Fast Carving of Mixture-of-Experts for Efficient LLM Inference

摘要

Support