Orion-14B：オープンソースの多言語大規模言語モデル

要旨

本研究では、140億パラメータを持つ多言語大規模言語モデル群「Orion-14B」を紹介する。英語、中国語、日本語、韓国語などの多様な言語から収集した2.5兆トークンのコーパスを用い、データスケジューリング手法を活用して基盤モデルを学習した。さらに、会話型アプリケーションやその他の特定用途に特化した一連のモデルをファインチューニングした。評価結果から、Orion-14Bは幅広いタスクにおいて最先端の性能を達成することが示された。Orion-14Bモデルファミリーと関連コードをhttps://github.com/OrionStarAI/Orionで公開し、今後の研究と実践的な応用の発展に貢献することを目指している。

English

In this study, we introduce Orion-14B, a collection of multilingual large language models with 14 billion parameters. We utilize a data scheduling approach to train a foundational model on a diverse corpus of 2.5 trillion tokens, sourced from texts in English, Chinese, Japanese, Korean, and other languages. Additionally, we fine-tuned a series of models tailored for conversational applications and other specific use cases. Our evaluation results demonstrate that Orion-14B achieves state-of-the-art performance across a broad spectrum of tasks. We make the Orion-14B model family and its associated code publicly accessible https://github.com/OrionStarAI/Orion, aiming to inspire future research and practical applications in the field.

Orion-14B：オープンソースの多言語大規模言語モデル

Orion-14B: Open-source Multilingual Large Language Models

要旨

Support