Skywork：一个更开放的双语基础模型

摘要

在这份技术报告中，我们介绍了Skywork-13B，这是一个大型语言模型（LLMs）系列，训练数据来自包括英文和中文在内的超过3.2万亿标记的语料库。这个双语基础模型是迄今为止训练最为广泛且公开发布的同等规模的LLMs。我们提出了一种两阶段训练方法，使用分段语料库，分别针对通用训练和领域特定增强训练。我们展示了我们的模型不仅在流行的基准测试中表现出色，而且在各种领域的中文语言建模中达到了最先进的性能。此外，我们提出了一种新颖的泄漏检测方法，证明了测试数据污染是一个迫切需要LLM社区进一步调查的问题。为了推动未来的研究，我们发布了Skywork-13B以及在训练过程中获得的中间阶段的检查点。我们还发布了我们的SkyPile语料库的一部分，这是一个包含超过1500亿标记的网络文本集合，是迄今为止最大的高质量开放式中文预训练语料库。我们希望Skywork-13B和我们的开放语料库能够作为宝贵的开源资源，使高质量的LLMs能够民主化获取。

English

In this technical report, we present Skywork-13B, a family of large language models (LLMs) trained on a corpus of over 3.2 trillion tokens drawn from both English and Chinese texts. This bilingual foundation model is the most extensively trained and openly published LLMs of comparable size to date. We introduce a two-stage training methodology using a segmented corpus, targeting general purpose training and then domain-specific enhancement training, respectively. We show that our model not only excels on popular benchmarks, but also achieves state of the art performance in Chinese language modeling on diverse domains. Furthermore, we propose a novel leakage detection method, demonstrating that test data contamination is a pressing issue warranting further investigation by the LLM community. To spur future research, we release Skywork-13B along with checkpoints obtained during intermediate stages of the training process. We are also releasing part of our SkyPile corpus, a collection of over 150 billion tokens of web text, which is the largest high quality open Chinese pre-training corpus to date. We hope Skywork-13B and our open corpus will serve as a valuable open-source resource to democratize access to high-quality LLMs.

Skywork：一个更开放的双语基础模型

Skywork: A More Open Bilingual Foundation Model

摘要

Support