OpenBA: 처음부터 사전 학습된 오픈소스 15B 규모의 비대칭적 이중 언어 seq2seq 모델

초록

수십억 개의 파라미터를 가진 대규모 언어 모델(LLMs)은 다양한 자연어 처리 과제에서 뛰어난 성능을 보여주고 있습니다. 본 보고서는 중국어 중심의 오픈소스 모델 커뮤니티에 기여하기 위해, 오픈소스 15B 이중 언어 비대칭 seq2seq 모델인 OpenBA를 소개합니다. 우리는 OpenBA를 효과적이고 효율적인 기술로 강화하고, 모델을 처음부터 학습시키기 위해 세 단계의 학습 전략을 채택했습니다. 우리의 솔루션은 단 380B 토큰만으로도 매우 경쟁력 있는 성능을 달성할 수 있으며, 이는 BELEBELE 벤치마크에서 LLaMA-70B보다, MMLU 벤치마크에서 BLOOM-176B보다, C-Eval (hard) 벤치마크에서 GLM-130B보다 우수한 성능을 보입니다. 본 보고서는 유사 모델을 사전 학습시키기 위한 주요 세부 사항을 제공하며, 사전 학습 데이터 처리, 이중 언어 Flan 데이터 수집, 모델 아키텍처 설계에 영감을 준 경험적 관찰, 다양한 단계의 학습 목표, 그리고 기타 강화 기술을 포함합니다. 우리는 코드를 Huggingface Transformers 라이브러리의 설계 원칙에 따라 리팩토링하여 개발자들이 더 편리하게 사용할 수 있도록 했으며, 다양한 학습 단계의 체크포인트를 https://huggingface.co/openBA에서 공개했습니다. 프로젝트의 더 자세한 내용은 https://github.com/OpenNLG/openBA.git에서 확인할 수 있습니다.

English

Large language models (LLMs) with billions of parameters have demonstrated outstanding performance on various natural language processing tasks. This report presents OpenBA, an open-sourced 15B bilingual asymmetric seq2seq model, to contribute an LLM variant to the Chinese-oriented open-source model community. We enhance OpenBA with effective and efficient techniques as well as adopt a three-stage training strategy to train the model from scratch. Our solution can also achieve very competitive performance with only 380B tokens, which is better than LLaMA-70B on the BELEBELE benchmark, BLOOM-176B on the MMLU benchmark, GLM-130B on the C-Eval (hard) benchmark. This report provides the main details to pre-train an analogous model, including pre-training data processing, Bilingual Flan data collection, the empirical observations that inspire our model architecture design, training objectives of different stages, and other enhancement techniques. We have refactored our code to follow the design principles of the Huggingface Transformers Library, making it more convenient for developers to use, and released checkpoints of different training stages at https://huggingface.co/openBA. More details of our project are available at https://github.com/OpenNLG/openBA.git.

OpenBA: 처음부터 사전 학습된 오픈소스 15B 규모의 비대칭적 이중 언어 seq2seq 모델

OpenBA: An Open-sourced 15B Bilingual Asymmetric seq2seq Model Pre-trained from Scratch

초록

Support