JetMoE：用0.1M美元实现Llama2性能

摘要

大型语言模型（LLMs）取得了显著的成果，但其不断增长的资源需求已成为强大且可访问的超人类智能发展的主要障碍。本报告介绍了JetMoE-8B，这是一个新的LLM，仅耗资不到0.1百万美元进行训练，使用了来自精心混合的开源语料库的1.25T令牌和30,000个H100 GPU小时。尽管成本低廉，JetMoE-8B展现出令人印象深刻的性能，JetMoE-8B胜过了Llama2-7B模型，而JetMoE-8B-Chat超越了Llama2-13B-Chat模型。这些结果表明，LLM的训练可以比一般认为的更具成本效益。JetMoE-8B基于一种高效的稀疏门控专家混合（SMoE）架构，由注意力和前馈专家组成。这两个层都是稀疏激活的，使得JetMoE-8B在仅激活每个输入令牌的2B的情况下拥有8B参数，与Llama2-7B相比，推断计算减少约70%。此外，JetMoE-8B非常开放且友好于学术界，仅使用公共数据集和训练代码。本报告详细介绍了所有训练参数和数据混合，以促进未来在开放基础模型开发方面的努力。这种透明度旨在鼓励合作和推动可访问且高效的LLM领域的进一步发展。模型权重可在https://github.com/myshell-ai/JetMoE 上公开获取。

English

Large Language Models (LLMs) have achieved remarkable results, but their increasing resource demand has become a major obstacle to the development of powerful and accessible super-human intelligence. This report introduces JetMoE-8B, a new LLM trained with less than $0.1 million, using 1.25T tokens from carefully mixed open-source corpora and 30,000 H100 GPU hours. Despite its low cost, the JetMoE-8B demonstrates impressive performance, with JetMoE-8B outperforming the Llama2-7B model and JetMoE-8B-Chat surpassing the Llama2-13B-Chat model. These results suggest that LLM training can be much more cost-effective than generally thought. JetMoE-8B is based on an efficient Sparsely-gated Mixture-of-Experts (SMoE) architecture, composed of attention and feedforward experts. Both layers are sparsely activated, allowing JetMoE-8B to have 8B parameters while only activating 2B for each input token, reducing inference computation by about 70% compared to Llama2-7B. Moreover, JetMoE-8B is highly open and academia-friendly, using only public datasets and training code. All training parameters and data mixtures have been detailed in this report to facilitate future efforts in the development of open foundation models. This transparency aims to encourage collaboration and further advancements in the field of accessible and efficient LLMs. The model weights are publicly available at https://github.com/myshell-ai/JetMoE.

JetMoE：用0.1M美元实现Llama2性能

JetMoE: Reaching Llama2 Performance with 0.1M Dollars

摘要

Support