Skywork:一个更开放的双语基础模型
Skywork: A More Open Bilingual Foundation Model
October 30, 2023
作者: Tianwen Wei, Liang Zhao, Lichang Zhang, Bo Zhu, Lijie Wang, Haihua Yang, Biye Li, Cheng Cheng, Weiwei Lü, Rui Hu, Chenxia Li, Liu Yang, Xilin Luo, Xuejie Wu, Lunan Liu, Wenjun Cheng, Peng Cheng, Jianhao Zhang, Xiaoyu Zhang, Lei Lin, Xiaokun Wang, Yutuan Ma, Chuanhai Dong, Yanqi Sun, Yifu Chen, Yongyi Peng, Xiaojuan Liang, Shuicheng Yan, Han Fang, Yahui Zhou
cs.AI
摘要
在这份技术报告中,我们介绍了Skywork-13B,这是一个大型语言模型(LLMs)系列,训练数据来自包括英文和中文在内的超过3.2万亿标记的语料库。这个双语基础模型是迄今为止训练最为广泛且公开发布的同等规模的LLMs。我们提出了一种两阶段训练方法,使用分段语料库,分别针对通用训练和领域特定增强训练。我们展示了我们的模型不仅在流行的基准测试中表现出色,而且在各种领域的中文语言建模中达到了最先进的性能。此外,我们提出了一种新颖的泄漏检测方法,证明了测试数据污染是一个迫切需要LLM社区进一步调查的问题。为了推动未来的研究,我们发布了Skywork-13B以及在训练过程中获得的中间阶段的检查点。我们还发布了我们的SkyPile语料库的一部分,这是一个包含超过1500亿标记的网络文本集合,是迄今为止最大的高质量开放式中文预训练语料库。我们希望Skywork-13B和我们的开放语料库能够作为宝贵的开源资源,使高质量的LLMs能够民主化获取。
English
In this technical report, we present Skywork-13B, a family of large language
models (LLMs) trained on a corpus of over 3.2 trillion tokens drawn from both
English and Chinese texts. This bilingual foundation model is the most
extensively trained and openly published LLMs of comparable size to date. We
introduce a two-stage training methodology using a segmented corpus, targeting
general purpose training and then domain-specific enhancement training,
respectively. We show that our model not only excels on popular benchmarks, but
also achieves state of the art performance in Chinese language modeling
on diverse domains. Furthermore, we propose a novel leakage detection method,
demonstrating that test data contamination is a pressing issue warranting
further investigation by the LLM community. To spur future research, we release
Skywork-13B along with checkpoints obtained during intermediate stages of the
training process. We are also releasing part of our SkyPile corpus, a
collection of over 150 billion tokens of web text, which is the largest high
quality open Chinese pre-training corpus to date. We hope Skywork-13B and our
open corpus will serve as a valuable open-source resource to democratize access
to high-quality LLMs.