Ziya2: データ中心の学習こそがLLMに必要なすべてである

要旨

近年、クローズドソースおよびオープンソースの様々な大規模言語モデル（LLM）が提案され、複数のベンチマークで新記録を更新し続けています。しかし、LLMの開発には依然としていくつかの課題が存在します。例えば、ゼロからモデルを訓練するための高コストや、継続的な事前学習による破滅的忘却などです。これらの課題の多くはLLM研究の過程で取り組まれていますが、重要な実用的な制約として、多くの研究がモデルサイズの拡大を過度に追求し、学習プロセスにおける事前学習データの包括的な分析と最適化、およびコスト効率の良い設定下でのLLM訓練における適切なデータの組織化と活用を十分に行っていない点が挙げられます。本研究では、LLaMA2を基盤モデルとして採用し、7000億トークンでさらに事前学習を行った130億パラメータのモデルZiya2を提案します。ここでは、事前学習技術に焦点を当て、データ中心の最適化を用いてZiya2の学習プロセスを各段階で強化しました。実験の結果、Ziya2は特に代表的なオープンソースモデルと比較して有望な結果を示し、複数のベンチマークで他のモデルを大幅に上回る性能を発揮しました。Ziya2（Base）はhttps://huggingface.co/IDEA-CCNL/Ziya2-13B-Baseおよびhttps://modelscope.cn/models/Fengshenbang/Ziya2-13B-Base/summaryで公開されています。

English

Various large language models (LLMs) have been proposed in recent years, including closed- and open-source ones, continually setting new records on multiple benchmarks. However, the development of LLMs still faces several issues, such as high cost of training models from scratch, and continual pre-training leading to catastrophic forgetting, etc. Although many such issues are addressed along the line of research on LLMs, an important yet practical limitation is that many studies overly pursue enlarging model sizes without comprehensively analyzing and optimizing the use of pre-training data in their learning process, as well as appropriate organization and leveraging of such data in training LLMs under cost-effective settings. In this work, we propose Ziya2, a model with 13 billion parameters adopting LLaMA2 as the foundation model, and further pre-trained on 700 billion tokens, where we focus on pre-training techniques and use data-centric optimization to enhance the learning process of Ziya2 on different stages. Experiments show that Ziya2 significantly outperforms other models in multiple benchmarks especially with promising results compared to representative open-source ones. Ziya2 (Base) is released at https://huggingface.co/IDEA-CCNL/Ziya2-13B-Base and https://modelscope.cn/models/Fengshenbang/Ziya2-13B-Base/summary.

Ziya2: データ中心の学習こそがLLMに必要なすべてである

Ziya2: Data-centric Learning is All LLMs Need

要旨

Support