ChatPaper.aiChatPaper

Orion-14B:开源多语言大型语言模型

Orion-14B: Open-source Multilingual Large Language Models

January 20, 2024
作者: Du Chen, Yi Huang, Xiaopu Li, Yongqiang Li, Yongqiang Liu, Haihui Pan, Leichao Xu, Dacheng Zhang, Zhipeng Zhang, Kun Han
cs.AI

摘要

在本研究中,我们介绍了Orion-14B,这是一个拥有140亿参数的多语言大型语言模型集合。我们采用数据调度方法,在包含2500万亿标记的多样化语料库上训练了一个基础模型,这些标记来自英语、中文、日语、韩语和其他语言的文本。此外,我们对一系列针对会话应用和其他特定用例的模型进行了微调。我们的评估结果表明,Orion-14B在广泛任务范围内实现了最先进的性能。我们将Orion-14B模型系列及其相关代码公开发布在https://github.com/OrionStarAI/Orion,旨在激发未来在该领域的研究和实际应用。
English
In this study, we introduce Orion-14B, a collection of multilingual large language models with 14 billion parameters. We utilize a data scheduling approach to train a foundational model on a diverse corpus of 2.5 trillion tokens, sourced from texts in English, Chinese, Japanese, Korean, and other languages. Additionally, we fine-tuned a series of models tailored for conversational applications and other specific use cases. Our evaluation results demonstrate that Orion-14B achieves state-of-the-art performance across a broad spectrum of tasks. We make the Orion-14B model family and its associated code publicly accessible https://github.com/OrionStarAI/Orion, aiming to inspire future research and practical applications in the field.
PDF142December 15, 2024