中文微型LLM:預訓練一個以中文為中心的大型語言模型
Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model
April 5, 2024
作者: Xinrun Du, Zhouliang Yu, Songyang Gao, Ding Pan, Yuyang Cheng, Ziyang Ma, Ruibin Yuan, Xingwei Qu, Jiaheng Liu, Tianyu Zheng, Xinchen Luo, Guorui Zhou, Binhang Yuan, Wenhu Chen, Jie Fu, Ge Zhang
cs.AI
摘要
在這項研究中,我們介紹了CT-LLM,一個2B大型語言模型(LLM),展示了在開發LLM時優先考慮中文的重要轉變。CT-LLM獨特地從頭開始,與傳統方法有所不同,主要納入中文文本數據,利用包括1,200億標記在內的龐大語料庫,其中包括800億中文標記、300億英文標記和100億代碼標記。這種策略性組合有助於模型在理解和處理中文方面表現卓越,透過對齊技術進一步增強了這種能力。在CHC-Bench上表現出色,CT-LLM在中文語言任務上表現優異,並通過SFT展示了其在英語方面的嫻熟。這項研究挑戰了主要在英文語料庫上訓練LLM,然後將其適應到其他語言的現行範式,擴展了LLM訓練方法的視野。通過開源完整的中文LLM訓練過程,包括詳細的數據處理程序,獲得的大型適當預訓練中文語料庫(MAP-CC),以及精心選擇的多學科中文難度基準(CHC-Bench)和2B規模的中文微型LLM(CT-LLM),我們旨在促進學術界和工業界的進一步探索和創新,為更具包容性和多功能性的語言模型鋪平道路。
English
In this study, we introduce CT-LLM, a 2B large language model (LLM) that
illustrates a pivotal shift towards prioritizing the Chinese language in
developing LLMs. Uniquely initiated from scratch, CT-LLM diverges from the
conventional methodology by primarily incorporating Chinese textual data,
utilizing an extensive corpus of 1,200 billion tokens, including 800 billion
Chinese tokens, 300 billion English tokens, and 100 billion code tokens. This
strategic composition facilitates the model's exceptional proficiency in
understanding and processing Chinese, a capability further enhanced through
alignment techniques. Demonstrating remarkable performance on the CHC-Bench,
CT-LLM excels in Chinese language tasks, and showcases its adeptness in English
through SFT. This research challenges the prevailing paradigm of training LLMs
predominantly on English corpora and then adapting them to other languages,
broadening the horizons for LLM training methodologies. By open-sourcing the
full process of training a Chinese LLM, including a detailed data processing
procedure with the obtained Massive Appropriate Pretraining Chinese Corpus
(MAP-CC), a well-chosen multidisciplinary Chinese Hard Case Benchmark
(CHC-Bench), and the 2B-size Chinese Tiny LLM (CT-LLM), we aim to foster
further exploration and innovation in both academia and industry, paving the
way for more inclusive and versatile language models.Summary
AI-Generated Summary