瞬间完成：混合专家模型的简单高效参数更新

摘要

由于其在稠密模型上表现优异，混合专家（MoE）框架已成为大型语言模型中流行的架构。然而，在大规模情况下从头开始训练MoE是代价高昂的。现有方法通过独立预训练多个密集专家模型并使用它们来初始化MoE来缓解这一问题。这是通过使用专家的前馈网络（FFN）来初始化MoE的专家，同时合并其他参数来实现的。然而，这种方法仅限于将密集模型参数重用于仅限于FFN层，从而限制了将这些模型升级为MoE时的优势。我们提出了BAM（Branch-Attend-Mix），这是一种简单而有效的方法，可以解决这个缺点。BAM充分利用了专门的密集模型，不仅使用它们的FFN来初始化MoE层，还通过将专家的注意力参数完全初始化为Mixture of Attention（MoA）层的软变体来充分利用专家的注意力参数。我们探索了两种升级注意力参数的方法：1）从密集模型初始化单独的注意力专家，包括所有注意力参数，以获得最佳的模型性能；2）跨所有专家共享关键和值参数，以促进更好的推理效率。为了进一步提高效率，我们将并行注意力变换器架构应用到MoE中，这允许同时计算注意力专家和FFN专家。我们对范围从5.9亿到20亿参数的种子模型进行的实验表明，BAM在困惑度和下游任务性能方面均超过了基线，在相同的计算和数据约束条件下。

English

The Mixture of Experts (MoE) framework has become a popular architecture for large language models due to its superior performance over dense models. However, training MoEs from scratch in a large-scale regime is prohibitively expensive. Existing methods mitigate this by pre-training multiple dense expert models independently and using them to initialize an MoE. This is done by using experts' feed-forward network (FFN) to initialize the MoE's experts while merging other parameters. However, this method limits the reuse of dense model parameters to only the FFN layers, thereby constraining the advantages when "upcycling" these models into MoEs. We propose BAM (Branch-Attend-Mix), a simple yet effective method that addresses this shortcoming. BAM makes full use of specialized dense models by not only using their FFN to initialize the MoE layers but also leveraging experts' attention parameters fully by initializing them into a soft-variant of Mixture of Attention (MoA) layers. We explore two methods for upcycling attention parameters: 1) initializing separate attention experts from dense models including all attention parameters for the best model performance; and 2) sharing key and value parameters across all experts to facilitate for better inference efficiency. To further improve efficiency, we adopt a parallel attention transformer architecture to MoEs, which allows the attention experts and FFN experts to be computed concurrently. Our experiments on seed models ranging from 590 million to 2 billion parameters demonstrate that BAM surpasses baselines in both perplexity and downstream task performance, within the same computational and data constraints.

瞬间完成：混合专家模型的简单高效参数更新

BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts

摘要

Support