Skywork：一個更開放的雙語基礎模型

摘要

在這份技術報告中，我們介紹了Skywork-13B，這是一個大型語言模型（LLM）系列，是通過從英文和中文文本中提取的超過3.2萬億標記訓練而成。這個雙語基礎模型是迄今為止訓練最廣泛並且公開發布的相同規模LLM之一。我們提出了一種兩階段訓練方法，使用分段語料庫，分別針對通用訓練和特定領域增強訓練。我們展示了我們的模型不僅在流行基準測試中表現優異，而且在不同領域的中文語言建模中實現了最先進的性能。此外，我們提出了一種新穎的洩漏檢測方法，證明測試數據的污染是一個迫切需要LLM社區進一步調查的問題。為了激勵未來的研究，我們釋出了Skywork-13B以及在訓練過程中獲得的中間階段檢查點。我們還釋出了我們的SkyPile語料庫的部分內容，這是迄今為止最大的高質量開放式中文預訓練語料庫，包含超過1500億標記的網絡文本。我們希望Skywork-13B和我們的開放語料庫將成為一個有價值的開源資源，以實現高質量LLM的民主化訪問。

English

In this technical report, we present Skywork-13B, a family of large language models (LLMs) trained on a corpus of over 3.2 trillion tokens drawn from both English and Chinese texts. This bilingual foundation model is the most extensively trained and openly published LLMs of comparable size to date. We introduce a two-stage training methodology using a segmented corpus, targeting general purpose training and then domain-specific enhancement training, respectively. We show that our model not only excels on popular benchmarks, but also achieves state of the art performance in Chinese language modeling on diverse domains. Furthermore, we propose a novel leakage detection method, demonstrating that test data contamination is a pressing issue warranting further investigation by the LLM community. To spur future research, we release Skywork-13B along with checkpoints obtained during intermediate stages of the training process. We are also releasing part of our SkyPile corpus, a collection of over 150 billion tokens of web text, which is the largest high quality open Chinese pre-training corpus to date. We hope Skywork-13B and our open corpus will serve as a valuable open-source resource to democratize access to high-quality LLMs.

Skywork：一個更開放的雙語基礎模型

Skywork: A More Open Bilingual Foundation Model

摘要

Support