ChatPaper.aiChatPaper

中文微型LLM:预训练一个以中文为中心的大型语言模型

Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model

April 5, 2024
作者: Xinrun Du, Zhouliang Yu, Songyang Gao, Ding Pan, Yuyang Cheng, Ziyang Ma, Ruibin Yuan, Xingwei Qu, Jiaheng Liu, Tianyu Zheng, Xinchen Luo, Guorui Zhou, Binhang Yuan, Wenhu Chen, Jie Fu, Ge Zhang
cs.AI

摘要

在本研究中,我们介绍了CT-LLM,一个2B大型语言模型(LLM),展示了在开发LLM时优先考虑中文的重要转变。CT-LLM独特地从零开始,与传统方法不同,主要整合了中文文本数据,利用了一个包括1,200亿标记的庞大语料库,其中包括800亿中文标记、300亿英文标记和100亿代码标记。这种战略组合促进了模型在理解和处理中文方面的出色能力,这种能力通过对齐技术进一步增强。在CHC-Bench上展现出卓越性能,CT-LLM在中文语言任务上表现出色,并通过SFT展示了其在英文方面的熟练程度。这项研究挑战了主要在英文语料库上训练LLM,然后将其调整到其他语言的现行范式,拓宽了LLM训练方法的视野。通过开源完整的中文LLM训练过程,包括详细的数据处理流程,获得的大规模适当预训练中文语料库(MAP-CC)、精心选择的跨学科中文难例基准(CHC-Bench)和2B规模的中文微型LLM(CT-LLM),我们旨在促进学术界和工业界进一步探索和创新,为更具包容性和多功能性的语言模型铺平道路。
English
In this study, we introduce CT-LLM, a 2B large language model (LLM) that illustrates a pivotal shift towards prioritizing the Chinese language in developing LLMs. Uniquely initiated from scratch, CT-LLM diverges from the conventional methodology by primarily incorporating Chinese textual data, utilizing an extensive corpus of 1,200 billion tokens, including 800 billion Chinese tokens, 300 billion English tokens, and 100 billion code tokens. This strategic composition facilitates the model's exceptional proficiency in understanding and processing Chinese, a capability further enhanced through alignment techniques. Demonstrating remarkable performance on the CHC-Bench, CT-LLM excels in Chinese language tasks, and showcases its adeptness in English through SFT. This research challenges the prevailing paradigm of training LLMs predominantly on English corpora and then adapting them to other languages, broadening the horizons for LLM training methodologies. By open-sourcing the full process of training a Chinese LLM, including a detailed data processing procedure with the obtained Massive Appropriate Pretraining Chinese Corpus (MAP-CC), a well-chosen multidisciplinary Chinese Hard Case Benchmark (CHC-Bench), and the 2B-size Chinese Tiny LLM (CT-LLM), we aim to foster further exploration and innovation in both academia and industry, paving the way for more inclusive and versatile language models.

Summary

AI-Generated Summary

PDF142December 15, 2024