Poro 34B 與多語能力的祝福

摘要

目前，頂尖大型語言模型的預訓練需要數以兆計的文字，這遠遠超過大多數語言所能提供的量。雖然將多種語言的文字納入是獲取更多預訓練數據的明顯途徑，但多語性通常被視為一種詛咒，大多數模型訓練工作仍然主要專注於單個大型語言。我們認為多語性可以是一種福祉，並且應該能夠通過多語訓練大幅提升小語言模型的能力。在這項研究中，我們介紹了Poro 34B，這是一個擁有340億參數的模型，經過了1兆個芬蘭語、英語和編程語言的標記訓練。我們展示了多語訓練方法可以產生一個模型，不僅在芬蘭語方面明顯超越現有模型的能力，而且在翻譯方面表現出色，在生成英語和編程語言方面也具有競爭力。我們在https://huggingface.co/LumiOpen/Poro-34B 釋出了模型參數、腳本和數據，並採用開放許可證。

English

The pretraining of state-of-the-art large language models now requires trillions of words of text, which is orders of magnitude more than available for the vast majority of languages. While including text in more than one language is an obvious way to acquire more pretraining data, multilinguality is often seen as a curse, and most model training efforts continue to focus near-exclusively on individual large languages. We believe that multilinguality can be a blessing and that it should be possible to substantially improve over the capabilities of monolingual models for small languages through multilingual training. In this study, we introduce Poro 34B, a 34 billion parameter model trained for 1 trillion tokens of Finnish, English, and programming languages, and demonstrate that a multilingual training approach can produce a model that not only substantially advances over the capabilities of existing models for Finnish, but also excels in translation and is competitive in its class in generating English and programming languages. We release the model parameters, scripts, and data under open licenses at https://huggingface.co/LumiOpen/Poro-34B.

Poro 34B 與多語能力的祝福

Poro 34B and the Blessing of Multilinguality

摘要

Support