白川2:开放式大规模语言模型
Baichuan 2: Open Large-scale Language Models
September 19, 2023
作者: Aiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang, Chao Yin, Chenxu Lv, Da Pan, Dian Wang, Dong Yan, Fan Yang, Fei Deng, Feng Wang, Feng Liu, Guangwei Ai, Guosheng Dong Haizhou Zhao, Hang Xu, Haoze Sun, Hongda Zhang, Hui Liu, Jiaming Ji, Jian Xie, Juntao Dai, Kun Fang, Lei Su Liang Song, Lifeng Liu, Liyun Ru, Luyao Ma, Mang Wang, Mickel Liu, MingAn Lin, Nuolan Nie, Peidong Guo, Ruiyang Sun, Tao Zhang, Tianpeng Li, Tianyu Li, Wei Cheng, Weipeng Chen, Xiangrong Zeng, Xiaochuan Wang, Xiaoxi Chen, Xin Men, Xin Yu, Xuehai Pan, Yanjun Shen, Yiding Wang, Yiyu Li, Youxin Jiang, Yuchen Gao, Yupeng Zhang, Zenan Zhou, Zhiying Wu
cs.AI
摘要
大型语言模型(LLMs)仅基于少量自然语言指令示例,在各种自然语言任务上展现出卓越性能,减少了对广泛特征工程的需求。然而,大多数功能强大的LLMs是闭源的,或者在其他语言方面的能力受到限制,无法与英语相提并论。在本技术报告中,我们介绍了百川2(Baichuan 2),这是一系列包含70亿和130亿参数的大规模多语言语言模型,从头开始训练,共训练了26万亿标记。百川2在公共基准测试中,如MMLU、CMMLU、GSM8K和HumanEval上,与其他开源模型相媲美甚至表现更优秀。此外,百川2在医学和法律等垂直领域表现出色。我们将发布所有预训练模型检查点,以帮助研究社区更好地理解百川2的训练动态。
English
Large language models (LLMs) have demonstrated remarkable performance on a
variety of natural language tasks based on just a few examples of natural
language instructions, reducing the need for extensive feature engineering.
However, most powerful LLMs are closed-source or limited in their capability
for languages other than English. In this technical report, we present Baichuan
2, a series of large-scale multilingual language models containing 7 billion
and 13 billion parameters, trained from scratch, on 2.6 trillion tokens.
Baichuan 2 matches or outperforms other open-source models of similar size on
public benchmarks like MMLU, CMMLU, GSM8K, and HumanEval. Furthermore, Baichuan
2 excels in vertical domains such as medicine and law. We will release all
pre-training model checkpoints to benefit the research community in better
understanding the training dynamics of Baichuan 2.