Polyglot-Ko 技術報告：開源大規模韓文語言模型

摘要

Polyglot 是一個開創性的項目，旨在增強多語言語言模型的非英語表現。儘管存在各種多語言模型，如 mBERT（Devlin 等，2019）、XGLM（Lin 等，2022）和 BLOOM（Scao 等，2022），研究人員和開發人員通常會因對當前多語言模型在非英語語言方面的表現不滿意而轉而構建各自語言的單語言模型。為填補這一差距，我們致力於開發先進的多語言語言模型，提供改進的非英語語言表現。在本文中，我們介紹了 Polyglot 韓文模型，這些模型具有特定的焦點，而非多語言性質。在與 TUNiB 合作的過程中，我們的團隊精心收集了1.2TB 的韓文數據，這些數據是為我們的研究之旅精心策劃的。我們故意決定優先開發韓文模型，而不是馬上進入多語言模型。這一選擇受到多種因素的驅使：首先，韓文模型有助於與現有多語言模型進行性能比較；最後，它們滿足了韓國公司和研究人員的特定需求。本文介紹了我們在開發 Polyglot 韓文模型方面的工作，提出了一些解決多語言語言模型中非英語語言表現差距的步驟。

English

Polyglot is a pioneering project aimed at enhancing the non-English language performance of multilingual language models. Despite the availability of various multilingual models such as mBERT (Devlin et al., 2019), XGLM (Lin et al., 2022), and BLOOM (Scao et al., 2022), researchers and developers often resort to building monolingual models in their respective languages due to the dissatisfaction with the current multilingual models non-English language capabilities. Addressing this gap, we seek to develop advanced multilingual language models that offer improved performance in non-English languages. In this paper, we introduce the Polyglot Korean models, which represent a specific focus rather than being multilingual in nature. In collaboration with TUNiB, our team collected 1.2TB of Korean data meticulously curated for our research journey. We made a deliberate decision to prioritize the development of Korean models before venturing into multilingual models. This choice was motivated by multiple factors: firstly, the Korean models facilitated performance comparisons with existing multilingual models; and finally, they catered to the specific needs of Korean companies and researchers. This paper presents our work in developing the Polyglot Korean models, which propose some steps towards addressing the non-English language performance gap in multilingual language models.

Polyglot-Ko 技術報告：開源大規模韓文語言模型

A Technical Report for Polyglot-Ko: Open-Source Large-Scale Korean Language Models

摘要

Support