Poro 34B 與多語能力的祝福
Poro 34B and the Blessing of Multilinguality
April 2, 2024
作者: Risto Luukkonen, Jonathan Burdge, Elaine Zosa, Aarne Talman, Ville Komulainen, Väinö Hatanpää, Peter Sarlin, Sampo Pyysalo
cs.AI
摘要
目前,頂尖大型語言模型的預訓練需要數以兆計的文字,這遠遠超過大多數語言所能提供的量。雖然將多種語言的文字納入是獲取更多預訓練數據的明顯途徑,但多語性通常被視為一種詛咒,大多數模型訓練工作仍然主要專注於單個大型語言。我們認為多語性可以是一種福祉,並且應該能夠通過多語訓練大幅提升小語言模型的能力。在這項研究中,我們介紹了Poro 34B,這是一個擁有340億參數的模型,經過了1兆個芬蘭語、英語和編程語言的標記訓練。我們展示了多語訓練方法可以產生一個模型,不僅在芬蘭語方面明顯超越現有模型的能力,而且在翻譯方面表現出色,在生成英語和編程語言方面也具有競爭力。我們在https://huggingface.co/LumiOpen/Poro-34B 釋出了模型參數、腳本和數據,並採用開放許可證。
English
The pretraining of state-of-the-art large language models now requires
trillions of words of text, which is orders of magnitude more than available
for the vast majority of languages. While including text in more than one
language is an obvious way to acquire more pretraining data, multilinguality is
often seen as a curse, and most model training efforts continue to focus
near-exclusively on individual large languages. We believe that multilinguality
can be a blessing and that it should be possible to substantially improve over
the capabilities of monolingual models for small languages through multilingual
training. In this study, we introduce Poro 34B, a 34 billion parameter model
trained for 1 trillion tokens of Finnish, English, and programming languages,
and demonstrate that a multilingual training approach can produce a model that
not only substantially advances over the capabilities of existing models for
Finnish, but also excels in translation and is competitive in its class in
generating English and programming languages. We release the model parameters,
scripts, and data under open licenses at
https://huggingface.co/LumiOpen/Poro-34B.Summary
AI-Generated Summary