OpenBA:一个开源的 15B 双语非对称 seq2seq 模型 从头开始预训练
OpenBA: An Open-sourced 15B Bilingual Asymmetric seq2seq Model Pre-trained from Scratch
September 19, 2023
作者: Juntao Li, Zecheng Tang, Yuyang Ding, Pinzheng Wang, Pei Guo, Wangjie You, Dan Qiao, Wenliang Chen, Guohong Fu, Qiaoming Zhu, Guodong Zhou, Min Zhang
cs.AI
摘要
拥有数十亿参数的大型语言模型(LLMs)在各种自然语言处理任务中展现出卓越的性能。本报告介绍了OpenBA,一个开源的 15B 双语不对称 seq2seq 模型,旨在为面向中文的开源模型社区贡献一种LLM变体。我们通过有效和高效的技术增强了OpenBA,并采用了三阶段训练策略从头开始训练模型。我们的解决方案在只有 380B tokens 的情况下也能取得非常有竞争力的性能,优于 BELEBELE 基准测试中的LLaMA-70B,MMLU 基准测试中的BLOOM-176B,以及 C-Eval(hard)基准测试中的GLM-130B。本报告提供了预训练类似模型的主要细节,包括预训练数据处理、双语 Flan 数据收集、启发我们模型架构设计的经验观察、不同阶段的训练目标,以及其他增强技术。我们已重构了我们的代码,遵循了Huggingface Transformers Library的设计原则,使其更便于开发人员使用,并在 https://huggingface.co/openBA 上发布了不同训练阶段的检查点。我们项目的更多细节可在 https://github.com/OpenNLG/openBA.git 上找到。
English
Large language models (LLMs) with billions of parameters have demonstrated
outstanding performance on various natural language processing tasks. This
report presents OpenBA, an open-sourced 15B bilingual asymmetric seq2seq model,
to contribute an LLM variant to the Chinese-oriented open-source model
community. We enhance OpenBA with effective and efficient techniques as well as
adopt a three-stage training strategy to train the model from scratch. Our
solution can also achieve very competitive performance with only 380B tokens,
which is better than LLaMA-70B on the BELEBELE benchmark, BLOOM-176B on the
MMLU benchmark, GLM-130B on the C-Eval (hard) benchmark. This report provides
the main details to pre-train an analogous model, including pre-training data
processing, Bilingual Flan data collection, the empirical observations that
inspire our model architecture design, training objectives of different stages,
and other enhancement techniques. We have refactored our code to follow the
design principles of the Huggingface Transformers Library, making it more
convenient for developers to use, and released checkpoints of different
training stages at https://huggingface.co/openBA. More details of our project
are available at https://github.com/OpenNLG/openBA.git.