Drop-Upcycling:透過部分重新初始化訓練稀疏專家混合模型
Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization
February 26, 2025
作者: Taishi Nakamura, Takuya Akiba, Kazuki Fujii, Yusuke Oda, Rio Yokota, Jun Suzuki
cs.AI
摘要
專家混合(Mixture of Experts, MoE)架構相比同等容量的密集模型,顯著降低了訓練和推理成本。升級再利用(Upcycling)是一種利用預訓練密集模型來初始化和訓練MoE模型的方法。雖然升級再利用能帶來初始性能提升,但其訓練進展較從頭開始訓練更為緩慢,導致長期性能欠佳。我們提出了一種名為「Drop-Upcycling」的方法,有效解決了這一問題。Drop-Upcycling結合了兩種看似矛盾的方法:利用預訓練密集模型的知識,同時對部分權重進行統計上的重新初始化。這一策略性地促進了專家專業化,顯著提升了MoE模型在知識獲取上的效率。大規模實驗表明,Drop-Upcycling在長期訓練中,尤其是在數千億或更多標記的訓練場景下,顯著超越了以往的MoE構建方法。因此,我們的MoE模型僅使用5.9億活躍參數,就能在相同模型家族中實現與130億參數密集模型相當的性能,而所需的訓練浮點運算次數約為後者的四分之一。所有實驗資源,包括源代碼、訓練數據、模型檢查點和日誌,均已公開,以促進MoE研究的可重現性和未來探索。
English
The Mixture of Experts (MoE) architecture reduces the training and inference
cost significantly compared to a dense model of equivalent capacity. Upcycling
is an approach that initializes and trains an MoE model using a pre-trained
dense model. While upcycling leads to initial performance gains, the training
progresses slower than when trained from scratch, leading to suboptimal
performance in the long term. We propose Drop-Upcycling - a method that
effectively addresses this problem. Drop-Upcycling combines two seemingly
contradictory approaches: utilizing the knowledge of pre-trained dense models
while statistically re-initializing some parts of the weights. This approach
strategically promotes expert specialization, significantly enhancing the MoE
model's efficiency in knowledge acquisition. Extensive large-scale experiments
demonstrate that Drop-Upcycling significantly outperforms previous MoE
construction methods in the long term, specifically when training on hundreds
of billions of tokens or more. As a result, our MoE model with 5.9B active
parameters achieves comparable performance to a 13B dense model in the same
model family, while requiring approximately 1/4 of the training FLOPs. All
experimental resources, including source code, training data, model checkpoints
and logs, are publicly available to promote reproducibility and future research
on MoE.Summary
AI-Generated Summary