Orion-14B：开源多语言大型语言模型

摘要

在本研究中，我们介绍了Orion-14B，这是一个拥有140亿参数的多语言大型语言模型集合。我们采用数据调度方法，在包含2500万亿标记的多样化语料库上训练了一个基础模型，这些标记来自英语、中文、日语、韩语和其他语言的文本。此外，我们对一系列针对会话应用和其他特定用例的模型进行了微调。我们的评估结果表明，Orion-14B在广泛任务范围内实现了最先进的性能。我们将Orion-14B模型系列及其相关代码公开发布在https://github.com/OrionStarAI/Orion，旨在激发未来在该领域的研究和实际应用。

English

In this study, we introduce Orion-14B, a collection of multilingual large language models with 14 billion parameters. We utilize a data scheduling approach to train a foundational model on a diverse corpus of 2.5 trillion tokens, sourced from texts in English, Chinese, Japanese, Korean, and other languages. Additionally, we fine-tuned a series of models tailored for conversational applications and other specific use cases. Our evaluation results demonstrate that Orion-14B achieves state-of-the-art performance across a broad spectrum of tasks. We make the Orion-14B model family and its associated code publicly accessible https://github.com/OrionStarAI/Orion, aiming to inspire future research and practical applications in the field.

Orion-14B：开源多语言大型语言模型

Orion-14B: Open-source Multilingual Large Language Models

摘要

Support