Polyglot-Ko 技术报告:开源大规模韩语语言模型
A Technical Report for Polyglot-Ko: Open-Source Large-Scale Korean Language Models
June 4, 2023
作者: Hyunwoong Ko, Kichang Yang, Minho Ryu, Taekyoon Choi, Seungmu Yang, jiwung Hyun, Sungho Park
cs.AI
摘要
Polyglot 是一个开创性项目,旨在提升多语言语言模型的非英语语言性能。尽管存在各种多语言模型,如mBERT(Devlin等,2019)、XGLM(Lin等,2022)和BLOOM(Scao等,2022),研究人员和开发人员通常会因对当前多语言模型在非英语语言能力方面的不满而转而构建各自语言的单语言模型。为填补这一空白,我们致力于开发先进的多语言语言模型,以提供改进的非英语语言性能。在本文中,我们介绍了Polyglot 韩语模型,它具有特定的焦点而非多语言性质。与TUNiB合作,我们的团队精心收集了1.2TB的韩语数据,为我们的研究之旅做好了准备。我们有意决定在涉足多语言模型之前,优先发展韩语模型。这一选择受到多重因素的推动:首先,韩语模型有助于与现有多语言模型进行性能比较;最后,它们满足了韩国公司和研究人员的特定需求。本文介绍了我们在开发Polyglot 韩语模型方面的工作,提出了一些解决多语言语言模型中非英语语言性能差距的步骤。
English
Polyglot is a pioneering project aimed at enhancing the non-English language
performance of multilingual language models. Despite the availability of
various multilingual models such as mBERT (Devlin et al., 2019), XGLM (Lin et
al., 2022), and BLOOM (Scao et al., 2022), researchers and developers often
resort to building monolingual models in their respective languages due to the
dissatisfaction with the current multilingual models non-English language
capabilities. Addressing this gap, we seek to develop advanced multilingual
language models that offer improved performance in non-English languages. In
this paper, we introduce the Polyglot Korean models, which represent a specific
focus rather than being multilingual in nature. In collaboration with TUNiB,
our team collected 1.2TB of Korean data meticulously curated for our research
journey. We made a deliberate decision to prioritize the development of Korean
models before venturing into multilingual models. This choice was motivated by
multiple factors: firstly, the Korean models facilitated performance
comparisons with existing multilingual models; and finally, they catered to the
specific needs of Korean companies and researchers. This paper presents our
work in developing the Polyglot Korean models, which propose some steps towards
addressing the non-English language performance gap in multilingual language
models.