ChatPaper.aiChatPaper

MAP-Neo:高性能且透明的双语大型语言模型系列

MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series

May 29, 2024
作者: Ge Zhang, Scott Qu, Jiaheng Liu, Chenchen Zhang, Chenghua Lin, Chou Leuang Yu, Danny Pan, Esther Cheng, Jie Liu, Qunshu Lin, Raven Yuan, Tuney Zheng, Wei Pang, Xinrun Du, Yiming Liang, Yinghao Ma, Yizhi Li, Ziyang Ma, Bill Lin, Emmanouil Benetos, Huan Yang, Junting Zhou, Kaijing Ma, Minghao Liu, Morry Niu, Noah Wang, Quehry Que, Ruibo Liu, Sine Liu, Shawn Guo, Soren Gao, Wangchunshu Zhou, Xinyue Zhang, Yizhi Zhou, Yubo Wang, Yuelin Bai, Yuhan Zhang, Yuxiang Zhang, Zenith Wang, Zhenzhu Yang, Zijian Zhao, Jiajun Zhang, Wanli Ouyang, Wenhao Huang, Wenhu Chen
cs.AI

摘要

近年来,大型语言模型(LLMs)在不同任务上取得了前所未有的性能,取得了巨大进展。然而,由于商业利益,像GPT、Gemini和Claude这样的最具竞争力的模型被封装在专有接口背后,没有披露训练细节。最近,许多机构已经开源了几个强大的LLMs,如LLaMA-3,与现有的闭源LLMs相媲美。然而,大多数细节(如中间检查点、预训练语料库和训练代码等)仍未披露,只提供了模型的权重。为了提高LLMs的透明度,研究界已经开始开源真正开放的LLMs(如Pythia、Amber、OLMo),提供了更多细节(如预训练语料库和训练代码)。这些模型极大地推动了对这些大型模型的科学研究,包括它们的优势、劣势、偏见和风险。然而,我们观察到,现有的真正开放的LLMs在推理、知识和编码任务上仍然不及现有同等模型大小的最先进LLMs。因此,我们开源了MAP-Neo,这是一个性能卓越且透明的双语语言模型,拥有70亿参数,从头开始在45万亿高质量标记上进行训练。我们的MAP-Neo是第一个完全开源的双语LLM,性能可与现有最先进的LLMs相媲美。此外,我们公开了所有细节以重现我们的MAP-Neo,提供了经过清理的预训练语料库、数据清洗流程、检查点以及经过良好优化的训练/评估框架。最后,我们希望我们的MAP-Neo将增强和加强开放研究社区,激发更多创新和创造力,促进LLMs的进一步改进。
English
Large Language Models (LLMs) have made great strides in recent years to achieve unprecedented performance across different tasks. However, due to commercial interest, the most competitive models like GPT, Gemini, and Claude have been gated behind proprietary interfaces without disclosing the training details. Recently, many institutions have open-sourced several strong LLMs like LLaMA-3, comparable to existing closed-source LLMs. However, only the model's weights are provided with most details (e.g., intermediate checkpoints, pre-training corpus, and training code, etc.) being undisclosed. To improve the transparency of LLMs, the research community has formed to open-source truly open LLMs (e.g., Pythia, Amber, OLMo), where more details (e.g., pre-training corpus and training code) are being provided. These models have greatly advanced the scientific study of these large models including their strengths, weaknesses, biases and risks. However, we observe that the existing truly open LLMs on reasoning, knowledge, and coding tasks are still inferior to existing state-of-the-art LLMs with similar model sizes. To this end, we open-source MAP-Neo, a highly capable and transparent bilingual language model with 7B parameters trained from scratch on 4.5T high-quality tokens. Our MAP-Neo is the first fully open-sourced bilingual LLM with comparable performance compared to existing state-of-the-art LLMs. Moreover, we open-source all details to reproduce our MAP-Neo, where the cleaned pre-training corpus, data cleaning pipeline, checkpoints, and well-optimized training/evaluation framework are provided. Finally, we hope our MAP-Neo will enhance and strengthen the open research community and inspire more innovations and creativities to facilitate the further improvements of LLMs.

Summary

AI-Generated Summary

PDF503December 12, 2024