JetMoE: Erreichen der Leistung von Llama2 mit 0,1 Millionen Dollar

papers.abstract

Große Sprachmodelle (LLMs) haben bemerkenswerte Ergebnisse erzielt, aber ihr zunehmender Ressourcenbedarf ist zu einem Hauptproblem bei der Entwicklung leistungsstarker und zugänglicher übermenschlicher Intelligenz geworden. Dieser Bericht stellt JetMoE-8B vor, ein neues LLM, das mit weniger als 0,1 Millionen US-Dollar trainiert wurde, unter Verwendung von 1,25 Billionen Tokens aus sorgfältig gemischten Open-Source-Korpora und 30.000 H100 GPU-Stunden. Trotz der geringen Kosten zeigt JetMoE-8B eine beeindruckende Leistung, wobei JetMoE-8B das Modell Llama2-7B übertrifft und JetMoE-8B-Chat das Modell Llama2-13B-Chat übertrifft. Diese Ergebnisse legen nahe, dass das Training von LLMs viel kosteneffektiver sein kann als allgemein angenommen. JetMoE-8B basiert auf einer effizienten Architektur des spärlich aktivierten Mixture-of-Experts (SMoE), bestehend aus Aufmerksamkeits- und Feedforward-Experten. Beide Schichten sind spärlich aktiviert, was es JetMoE-8B ermöglicht, über 8B-Parameter zu verfügen, während nur 2B für jedes Eingabetoken aktiviert werden, was die Inferenzberechnung im Vergleich zu Llama2-7B um etwa 70% reduziert. Darüber hinaus ist JetMoE-8B sehr offen und akademikerfreundlich, da nur öffentliche Datensätze und Trainingscode verwendet werden. Alle Trainingsparameter und Datengemische wurden in diesem Bericht detailliert beschrieben, um zukünftige Bemühungen bei der Entwicklung offener Grundlagenmodelle zu erleichtern. Diese Transparenz zielt darauf ab, die Zusammenarbeit und weitere Fortschritte auf dem Gebiet der zugänglichen und effizienten LLMs zu fördern. Die Modellgewichte sind öffentlich unter https://github.com/myshell-ai/JetMoE verfügbar.

English

Large Language Models (LLMs) have achieved remarkable results, but their increasing resource demand has become a major obstacle to the development of powerful and accessible super-human intelligence. This report introduces JetMoE-8B, a new LLM trained with less than $0.1 million, using 1.25T tokens from carefully mixed open-source corpora and 30,000 H100 GPU hours. Despite its low cost, the JetMoE-8B demonstrates impressive performance, with JetMoE-8B outperforming the Llama2-7B model and JetMoE-8B-Chat surpassing the Llama2-13B-Chat model. These results suggest that LLM training can be much more cost-effective than generally thought. JetMoE-8B is based on an efficient Sparsely-gated Mixture-of-Experts (SMoE) architecture, composed of attention and feedforward experts. Both layers are sparsely activated, allowing JetMoE-8B to have 8B parameters while only activating 2B for each input token, reducing inference computation by about 70% compared to Llama2-7B. Moreover, JetMoE-8B is highly open and academia-friendly, using only public datasets and training code. All training parameters and data mixtures have been detailed in this report to facilitate future efforts in the development of open foundation models. This transparency aims to encourage collaboration and further advancements in the field of accessible and efficient LLMs. The model weights are publicly available at https://github.com/myshell-ai/JetMoE.

JetMoE: Erreichen der Leistung von Llama2 mit 0,1 Millionen Dollar

JetMoE: Reaching Llama2 Performance with 0.1M Dollars

papers.abstract

Support