白川2:開放式大規模語言模型
Baichuan 2: Open Large-scale Language Models
September 19, 2023
作者: Aiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang, Chao Yin, Chenxu Lv, Da Pan, Dian Wang, Dong Yan, Fan Yang, Fei Deng, Feng Wang, Feng Liu, Guangwei Ai, Guosheng Dong Haizhou Zhao, Hang Xu, Haoze Sun, Hongda Zhang, Hui Liu, Jiaming Ji, Jian Xie, Juntao Dai, Kun Fang, Lei Su Liang Song, Lifeng Liu, Liyun Ru, Luyao Ma, Mang Wang, Mickel Liu, MingAn Lin, Nuolan Nie, Peidong Guo, Ruiyang Sun, Tao Zhang, Tianpeng Li, Tianyu Li, Wei Cheng, Weipeng Chen, Xiangrong Zeng, Xiaochuan Wang, Xiaoxi Chen, Xin Men, Xin Yu, Xuehai Pan, Yanjun Shen, Yiding Wang, Yiyu Li, Youxin Jiang, Yuchen Gao, Yupeng Zhang, Zenan Zhou, Zhiying Wu
cs.AI
摘要
大型語言模型(LLMs)展示了在僅憑幾個自然語言指令示例的基礎上,在各種自然語言任務上表現出色,減少了對廣泛特徵工程的需求。然而,大多數功能強大的LLMs都是封閉源碼,或在其他語言方面的能力受到限制,無法與英語相提並論。在這份技術報告中,我們介紹了Baichuan 2,這是一系列包含70億和130億參數的大規模多語言語言模型,從頭開始訓練,總共有2600億標記。Baichuan 2在公開基準測試如MMLU、CMMLU、GSM8K和HumanEval上與其他開源模型表現匹敵甚至優於其,此外,Baichuan 2在醫學和法律等垂直領域表現出色。我們將釋出所有預訓練模型檢查點,以幫助研究社群更好地理解Baichuan 2的訓練動態。
English
Large language models (LLMs) have demonstrated remarkable performance on a
variety of natural language tasks based on just a few examples of natural
language instructions, reducing the need for extensive feature engineering.
However, most powerful LLMs are closed-source or limited in their capability
for languages other than English. In this technical report, we present Baichuan
2, a series of large-scale multilingual language models containing 7 billion
and 13 billion parameters, trained from scratch, on 2.6 trillion tokens.
Baichuan 2 matches or outperforms other open-source models of similar size on
public benchmarks like MMLU, CMMLU, GSM8K, and HumanEval. Furthermore, Baichuan
2 excels in vertical domains such as medicine and law. We will release all
pre-training model checkpoints to benefit the research community in better
understanding the training dynamics of Baichuan 2.