瞧！就是這樣：混合專家模型的簡單高效參數循環利用

摘要

由於混合專家（MoE）框架在大型語言模型中表現優越，已成為一種流行的架構，優於密集模型。然而，在大規模情況下從頭開始訓練MoEs的成本過高。現有方法通過獨立地預先訓練多個密集專家模型並使用它們來初始化MoE來緩解這一問題。這是通過使用專家的前饋網絡（FFN）來初始化MoE的專家，同時合併其他參數來完成的。然而，這種方法僅限於將密集模型參數重複使用到FFN層，因此在將這些模型升級為MoEs時，限制了其優勢。我們提出了BAM（Branch-Attend-Mix），這是一種簡單而有效的方法，解決了這個缺陷。BAM充分利用了專用密集模型，不僅使用它們的FFN來初始化MoE層，還通過將專家的注意力參數完全初始化為Mixture of Attention（MoA）層的軟變體，來發揮作用。我們探索了兩種升級注意力參數的方法：1）從密集模型初始化單獨的注意力專家，包括所有注意力參數，以獲得最佳的模型性能；和2）在所有專家之間共享關鍵和值參數，以促進更好的推理效率。為了進一步提高效率，我們採用了一種並行注意力變換器架構到MoEs，這使得注意力專家和FFN專家可以同時計算。我們對從5.9億到20億參數的種子模型進行的實驗表明，BAM在困惑度和下游任務性能方面均超越了基線，在相同的計算和數據限制條件下。

English

The Mixture of Experts (MoE) framework has become a popular architecture for large language models due to its superior performance over dense models. However, training MoEs from scratch in a large-scale regime is prohibitively expensive. Existing methods mitigate this by pre-training multiple dense expert models independently and using them to initialize an MoE. This is done by using experts' feed-forward network (FFN) to initialize the MoE's experts while merging other parameters. However, this method limits the reuse of dense model parameters to only the FFN layers, thereby constraining the advantages when "upcycling" these models into MoEs. We propose BAM (Branch-Attend-Mix), a simple yet effective method that addresses this shortcoming. BAM makes full use of specialized dense models by not only using their FFN to initialize the MoE layers but also leveraging experts' attention parameters fully by initializing them into a soft-variant of Mixture of Attention (MoA) layers. We explore two methods for upcycling attention parameters: 1) initializing separate attention experts from dense models including all attention parameters for the best model performance; and 2) sharing key and value parameters across all experts to facilitate for better inference efficiency. To further improve efficiency, we adopt a parallel attention transformer architecture to MoEs, which allows the attention experts and FFN experts to be computed concurrently. Our experiments on seed models ranging from 590 million to 2 billion parameters demonstrate that BAM surpasses baselines in both perplexity and downstream task performance, within the same computational and data constraints.

瞧！就是這樣：混合專家模型的簡單高效參數循環利用

BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts

摘要

Support