Ziya2: 데이터 중심 학습이 모든 LLM(대형 언어 모델)에 필요한 전부다

초록

최근 몇 년 동안 폐쇄형 및 오픈소스 등 다양한 대규모 언어 모델(LLM)이 제안되며, 여러 벤치마크에서 지속적으로 새로운 기록을 세워왔습니다. 그러나 LLM의 개발은 여전히 몇 가지 문제에 직면해 있습니다. 예를 들어, 처음부터 모델을 학습시키는 데 드는 높은 비용, 지속적인 사전 학습으로 인한 치명적 망각(catastrophic forgetting) 등이 있습니다. 이러한 많은 문제들이 LLM 연구 과정에서 해결되고 있지만, 중요한 실질적인 한계는 많은 연구가 모델 크기를 키우는 데 지나치게 집중하면서 사전 학습 데이터의 사용을 종합적으로 분석하고 최적화하지 못하며, 비용 효율적인 설정 하에서 LLM을 학습할 때 이러한 데이터를 적절히 조직하고 활용하지 못한다는 점입니다. 본 연구에서는 LLaMA2를 기반 모델로 채택하고 7000억 개의 토큰으로 추가 사전 학습을 진행한 130억 개의 파라미터를 가진 Ziya2 모델을 제안합니다. 여기서 우리는 사전 학습 기술에 초점을 맞추고 데이터 중심 최적화를 통해 Ziya2의 학습 과정을 다양한 단계에서 개선했습니다. 실험 결과, Ziya2는 여러 벤치마크에서 특히 대표적인 오픈소스 모델들과 비교했을 때 유망한 결과를 보이며 다른 모델들을 크게 앞섰습니다. Ziya2 (Base)는 https://huggingface.co/IDEA-CCNL/Ziya2-13B-Base와 https://modelscope.cn/models/Fengshenbang/Ziya2-13B-Base/summary에서 공개되었습니다.

English

Various large language models (LLMs) have been proposed in recent years, including closed- and open-source ones, continually setting new records on multiple benchmarks. However, the development of LLMs still faces several issues, such as high cost of training models from scratch, and continual pre-training leading to catastrophic forgetting, etc. Although many such issues are addressed along the line of research on LLMs, an important yet practical limitation is that many studies overly pursue enlarging model sizes without comprehensively analyzing and optimizing the use of pre-training data in their learning process, as well as appropriate organization and leveraging of such data in training LLMs under cost-effective settings. In this work, we propose Ziya2, a model with 13 billion parameters adopting LLaMA2 as the foundation model, and further pre-trained on 700 billion tokens, where we focus on pre-training techniques and use data-centric optimization to enhance the learning process of Ziya2 on different stages. Experiments show that Ziya2 significantly outperforms other models in multiple benchmarks especially with promising results compared to representative open-source ones. Ziya2 (Base) is released at https://huggingface.co/IDEA-CCNL/Ziya2-13B-Base and https://modelscope.cn/models/Fengshenbang/Ziya2-13B-Base/summary.

Ziya2: 데이터 중심 학습이 모든 LLM(대형 언어 모델)에 필요한 전부다

Ziya2: Data-centric Learning is All LLMs Need

초록

Support