JetMoE：以0.1百萬美元達到Llama2的性能

摘要

大型語言模型（LLMs）取得了顯著的成就，但其日益增長的資源需求已成為發展強大且易於存取的超人類智能的主要障礙。本報告介紹了JetMoE-8B，一個新的LLM，僅用不到0.1百萬美元進行訓練，使用了來自精心混合的開源語料庫的1.25T令牌和30,000 H100 GPU小時。儘管成本低廉，JetMoE-8B展現了令人印象深刻的性能，JetMoE-8B的表現優於Llama2-7B模型，而JetMoE-8B-Chat超越了Llama2-13B-Chat模型。這些結果表明，LLM的訓練可以比一般認為的更具成本效益。JetMoE-8B基於高效的稀疏門控專家混合（SMoE）架構，由注意力和前饋專家組成。兩層均稀疏激活，使得JetMoE-8B具有8B參數，而每個輸入令牌僅激活2B，相較於Llama2-7B，推理計算減少約70％。此外，JetMoE-8B高度開放且友好於學術界，僅使用公共數據集和訓練代碼。本報告詳細說明了所有訓練參數和數據混合，以促進未來在開放基礎模型發展方面的努力。透明度的目的是鼓勵合作並在可存取和高效的LLMs領域取得進一步進展。模型權重可在https://github.com/myshell-ai/JetMoE 公開獲取。

English

Large Language Models (LLMs) have achieved remarkable results, but their increasing resource demand has become a major obstacle to the development of powerful and accessible super-human intelligence. This report introduces JetMoE-8B, a new LLM trained with less than $0.1 million, using 1.25T tokens from carefully mixed open-source corpora and 30,000 H100 GPU hours. Despite its low cost, the JetMoE-8B demonstrates impressive performance, with JetMoE-8B outperforming the Llama2-7B model and JetMoE-8B-Chat surpassing the Llama2-13B-Chat model. These results suggest that LLM training can be much more cost-effective than generally thought. JetMoE-8B is based on an efficient Sparsely-gated Mixture-of-Experts (SMoE) architecture, composed of attention and feedforward experts. Both layers are sparsely activated, allowing JetMoE-8B to have 8B parameters while only activating 2B for each input token, reducing inference computation by about 70% compared to Llama2-7B. Moreover, JetMoE-8B is highly open and academia-friendly, using only public datasets and training code. All training parameters and data mixtures have been detailed in this report to facilitate future efforts in the development of open foundation models. This transparency aims to encourage collaboration and further advancements in the field of accessible and efficient LLMs. The model weights are publicly available at https://github.com/myshell-ai/JetMoE.

JetMoE：以0.1百萬美元達到Llama2的性能

JetMoE: Reaching Llama2 Performance with 0.1M Dollars

摘要

Support