Yuan 2.0-M32:具有注意力路由器的專家混合模型
Yuan 2.0-M32: Mixture of Experts with Attention Router
May 28, 2024
作者: Shaohua Wu, Jiangang Luo, Xi Chen, Lingjun Li, Xudong Zhao, Tong Yu, Chao Wang, Yue Wang, Fei Wang, Weixu Qiao, Houbo He, Zeru Zhang, Zeyu Sun, Junxiong Mao, Chong Shen
cs.AI
摘要
Yuan 2.0-M32採用了與Yuan-2.0 2B相似的基礎架構,採用了包含32位專家的專家混合架構,其中有2位專家處於活躍狀態。提出並採用了一種新的路由器網絡,即Attention Router,以更有效地選擇專家,這使準確度比具有傳統路由器網絡的模型提高了3.8%。Yuan 2.0-M32從頭開始使用了來自2000B tokens的訓練數據,而訓練計算消耗僅為相同參數規模下密集模型的9.25%。Yuan 2.0-M32在編碼、數學和各種專業領域展現出競爭力,其中活躍參數僅有40B總參數的3.7B,每個token的前向計算為7.4 GFlops,這兩者僅為Llama3-70B的1/19。Yuan 2.0-M32在MATH和ARC-Challenge基準測試上超越了Llama3-70B,準確度分別為55.89和95.8。Yuan 2.0-M32的模型和源代碼已在Github上釋出。
English
Yuan 2.0-M32, with a similar base architecture as Yuan-2.0 2B, uses a
mixture-of-experts architecture with 32 experts of which 2 experts are active.
A new router network, Attention Router, is proposed and adopted for a more
efficient selection of experts, which boosts the accuracy of 3.8% compared to
the model with classical router network. Yuan 2.0-M32 is trained with 2000B
tokens from scratch, and the training computation consumption is only 9.25% of
a dense model at the same parameter scale. Yuan 2.0-M32 demonstrates
competitive capability on coding, math, and various domains of expertise, with
only 3.7B active parameters of 40B in total, and 7.4 GFlops forward computation
per token, both of which are only 1/19 of Llama3-70B. Yuan 2.0-M32 surpass
Llama3-70B on MATH and ARC-Challenge benchmark, with accuracy of 55.89 and 95.8
respectively. The models and source codes of Yuan 2.0-M32 are released at
Github.Summary
AI-Generated Summary