Ziya2:数据中心学习是所有LLM所需的。
Ziya2: Data-centric Learning is All LLMs Need
November 6, 2023
作者: Ruyi Gan, Ziwei Wu, Renliang Sun, Junyu Lu, Xiaojun Wu, Dixiang Zhang, Kunhao Pan, Ping Yang, Qi Yang, Jiaxing Zhang, Yan Song
cs.AI
摘要
近年来提出了各种大型语言模型(LLMs),包括闭源和开源模型,不断在多个基准测试中刷新记录。然而,LLMs的发展仍面临一些问题,如从头开始训练模型的高成本,以及持续的预训练导致灾难性遗忘等。尽管许多研究在LLMs领域解决了许多这类问题,但一个重要且实际的限制是,许多研究过分追求扩大模型规模,而没有全面分析和优化在学习过程中预训练数据的使用,以及在成本有效设置下训练LLMs时适当组织和利用这些数据。在这项工作中,我们提出了Ziya2,这是一个拥有130亿参数的模型,采用LLaMA2作为基础模型,并在7000亿标记上进行了进一步的预训练,我们专注于预训练技术,并使用以数据为中心的优化来增强Ziya2在不同阶段的学习过程。实验表明,Ziya2在多个基准测试中明显优于其他模型,尤其是与代表性开源模型相比具有令人期待的结果。Ziya2(Base)已发布在https://huggingface.co/IDEA-CCNL/Ziya2-13B-Base和https://modelscope.cn/models/Fengshenbang/Ziya2-13B-Base/summary。
English
Various large language models (LLMs) have been proposed in recent years,
including closed- and open-source ones, continually setting new records on
multiple benchmarks. However, the development of LLMs still faces several
issues, such as high cost of training models from scratch, and continual
pre-training leading to catastrophic forgetting, etc. Although many such issues
are addressed along the line of research on LLMs, an important yet practical
limitation is that many studies overly pursue enlarging model sizes without
comprehensively analyzing and optimizing the use of pre-training data in their
learning process, as well as appropriate organization and leveraging of such
data in training LLMs under cost-effective settings. In this work, we propose
Ziya2, a model with 13 billion parameters adopting LLaMA2 as the foundation
model, and further pre-trained on 700 billion tokens, where we focus on
pre-training techniques and use data-centric optimization to enhance the
learning process of Ziya2 on different stages. Experiments show that Ziya2
significantly outperforms other models in multiple benchmarks especially with
promising results compared to representative open-source ones. Ziya2 (Base) is
released at https://huggingface.co/IDEA-CCNL/Ziya2-13B-Base and
https://modelscope.cn/models/Fengshenbang/Ziya2-13B-Base/summary.