ChatPaper.aiChatPaper

Yuan 2.0-M32:带有注意力路由器的专家混合模型

Yuan 2.0-M32: Mixture of Experts with Attention Router

May 28, 2024
作者: Shaohua Wu, Jiangang Luo, Xi Chen, Lingjun Li, Xudong Zhao, Tong Yu, Chao Wang, Yue Wang, Fei Wang, Weixu Qiao, Houbo He, Zeru Zhang, Zeyu Sun, Junxiong Mao, Chong Shen
cs.AI

摘要

Yuan 2.0-M32采用与Yuan-2.0 2B相似的基础架构,采用了包含32个专家的专家混合架构,其中有2个专家处于活跃状态。提出并采纳了一种新的路由器网络,称为Attention Router,用于更高效地选择专家,这使得准确率比采用经典路由器网络的模型提高了3.8%。Yuan 2.0-M32从零开始使用了来自2000B标记的训练数据,而训练计算消耗仅为具有相同参数规模的密集模型的9.25%。Yuan 2.0-M32在编码、数学和各种专业领域展现出竞争力,仅有40B总参数中的3.7B处于活跃状态,每个标记的前向计算为7.4 GFlops,这两者仅为Llama3-70B的1/19。Yuan 2.0-M32在MATH和ARC-Challenge基准测试中超越了Llama3-70B,准确率分别为55.89和95.8。Yuan 2.0-M32的模型和源代码已在Github上发布。
English
Yuan 2.0-M32, with a similar base architecture as Yuan-2.0 2B, uses a mixture-of-experts architecture with 32 experts of which 2 experts are active. A new router network, Attention Router, is proposed and adopted for a more efficient selection of experts, which boosts the accuracy of 3.8% compared to the model with classical router network. Yuan 2.0-M32 is trained with 2000B tokens from scratch, and the training computation consumption is only 9.25% of a dense model at the same parameter scale. Yuan 2.0-M32 demonstrates competitive capability on coding, math, and various domains of expertise, with only 3.7B active parameters of 40B in total, and 7.4 GFlops forward computation per token, both of which are only 1/19 of Llama3-70B. Yuan 2.0-M32 surpass Llama3-70B on MATH and ARC-Challenge benchmark, with accuracy of 55.89 and 95.8 respectively. The models and source codes of Yuan 2.0-M32 are released at Github.

Summary

AI-Generated Summary

PDF222December 12, 2024