Orion-14B: 오픈소스 다국어 대규모 언어 모델

초록

본 연구에서는 140억 개의 파라미터를 가진 다국어 대규모 언어 모델 컬렉션인 Orion-14B를 소개합니다. 우리는 영어, 중국어, 일본어, 한국어 및 기타 언어로 된 텍스트에서 추출한 2.5조 개의 토큰으로 구성된 다양한 코퍼스를 기반으로 데이터 스케줄링 방식을 활용하여 기초 모델을 학습시켰습니다. 또한, 대화형 애플리케이션 및 기타 특정 사용 사례에 맞춰 조정된 일련의 모델을 미세 조정했습니다. 평가 결과, Orion-14B는 광범위한 작업에서 최첨단 성능을 달성함을 보여줍니다. 우리는 Orion-14B 모델 패밀리와 관련 코드를 https://github.com/OrionStarAI/Orion에서 공개하여, 해당 분야의 미래 연구와 실용적 응용을 촉진하고자 합니다.

English

In this study, we introduce Orion-14B, a collection of multilingual large language models with 14 billion parameters. We utilize a data scheduling approach to train a foundational model on a diverse corpus of 2.5 trillion tokens, sourced from texts in English, Chinese, Japanese, Korean, and other languages. Additionally, we fine-tuned a series of models tailored for conversational applications and other specific use cases. Our evaluation results demonstrate that Orion-14B achieves state-of-the-art performance across a broad spectrum of tasks. We make the Orion-14B model family and its associated code publicly accessible https://github.com/OrionStarAI/Orion, aiming to inspire future research and practical applications in the field.

Orion-14B: 오픈소스 다국어 대규모 언어 모델

Orion-14B: Open-source Multilingual Large Language Models

초록

Support