ChatPaper.aiChatPaper

Poro 34B与多语言优势的加持

Poro 34B and the Blessing of Multilinguality

April 2, 2024
作者: Risto Luukkonen, Jonathan Burdge, Elaine Zosa, Aarne Talman, Ville Komulainen, Väinö Hatanpää, Peter Sarlin, Sampo Pyysalo
cs.AI

摘要

最先进的大型语言模型的预训练如今需要数万亿字的文本,这远远超出了绝大多数语言可获取的文本量级。尽管包含多种语言的文本是获取更多预训练数据的显而易见的方法,但多语言性常被视为一种诅咒,大多数模型训练工作仍几乎完全集中在个别大型语言上。我们相信,多语言性可以是一种福音,通过多语言训练,完全有可能显著提升对小语言的处理能力,超越单语模型的表现。在本研究中,我们推出了Poro 34B,这是一个拥有340亿参数的模型,针对芬兰语、英语及编程语言进行了1万亿个标记的训练,并证明多语言训练方法不仅能大幅提升现有芬兰语模型的能力,还在翻译方面表现出色,并在生成英语和编程语言方面与其类别中的模型竞争激烈。我们已在https://huggingface.co/LumiOpen/Poro-34B 下以开放许可发布模型参数、脚本和数据。
English
The pretraining of state-of-the-art large language models now requires trillions of words of text, which is orders of magnitude more than available for the vast majority of languages. While including text in more than one language is an obvious way to acquire more pretraining data, multilinguality is often seen as a curse, and most model training efforts continue to focus near-exclusively on individual large languages. We believe that multilinguality can be a blessing and that it should be possible to substantially improve over the capabilities of monolingual models for small languages through multilingual training. In this study, we introduce Poro 34B, a 34 billion parameter model trained for 1 trillion tokens of Finnish, English, and programming languages, and demonstrate that a multilingual training approach can produce a model that not only substantially advances over the capabilities of existing models for Finnish, but also excels in translation and is competitive in its class in generating English and programming languages. We release the model parameters, scripts, and data under open licenses at https://huggingface.co/LumiOpen/Poro-34B.

Summary

AI-Generated Summary

PDF161November 26, 2024