PolyLM: 오픈소스 다국어 대규모 언어 모델

초록

대규모 언어 모델(LLM)은 자연어 지시를 이해하고 추론하며 생성하는 놀라운 능력을 보여줍니다. 그러나 LLM의 개발은 주로 영어와 같은 고자원 언어에 집중되어 있어, 다른 언어에서의 적용성과 연구가 제한되고 있습니다. 이에 따라, 우리는 6400억 개의 토큰으로 훈련된 다국어 LLM인 PolyLM을 소개합니다. 이 모델은 1.7B와 13B 두 가지 크기로 제공됩니다. 다국어 능력을 강화하기 위해, 우리는 1) 훈련 데이터에 이중 언어 데이터를 통합하고, 2) 사전 훈련 과정에서 비영어 데이터의 비율을 첫 단계에서 30%에서 최종 단계에서 60%로 증가시키는 커리큘럼 학습 전략을 채택했습니다. 더 나아가, 우리는 모델 미세 조정을 위해 132.7K개의 다양한 다국어 지시를 자동으로 생성하는 다국어 자기 지시 방법을 제안합니다. 모델의 성능을 평가하기 위해, 우리는 다국어 이해, 질문 응답, 생성 및 번역을 포함한 여러 기존 다국어 작업을 수집했습니다. 광범위한 실험 결과, PolyLM은 다국어 작업에서 LLaMA 및 BLOOM과 같은 다른 오픈소스 모델을 능가하면서도 영어에서 비슷한 성능을 유지하는 것으로 나타났습니다. 우리의 모델과 지시 데이터, 다국어 벤치마크는 https://modelscope.cn/models/damo/nlp_polylm_13b_text_generation에서 확인할 수 있습니다.

English

Large language models (LLMs) demonstrate remarkable ability to comprehend, reason, and generate following nature language instructions. However, the development of LLMs has been primarily focused on high-resource languages, such as English, thereby limiting their applicability and research in other languages. Consequently, we present PolyLM, a multilingual LLM trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B. To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training. Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning. To assess the model's performance, we collect several existing multilingual tasks, including multilingual understanding, question answering, generation, and translation. Extensive experiments show that PolyLM surpasses other open-source models such as LLaMA and BLOOM on multilingual tasks while maintaining comparable performance in English. Our models, alone with the instruction data and multilingual benchmark, are available at: https://modelscope.cn/models/damo/nlp_polylm_13b_text_generation.

PolyLM: 오픈소스 다국어 대규모 언어 모델

PolyLM: An Open Source Polyglot Large Language Model

초록

Support