FinGPT:用于小语种的大型生成模型
FinGPT: Large Generative Models for a Small Language
November 3, 2023
作者: Risto Luukkonen, Ville Komulainen, Jouni Luoma, Anni Eskelinen, Jenna Kanerva, Hanna-Mari Kupari, Filip Ginter, Veronika Laippala, Niklas Muennighoff, Aleksandra Piktus, Thomas Wang, Nouamane Tazi, Teven Le Scao, Thomas Wolf, Osma Suominen, Samuli Sairanen, Mikko Merioksa, Jyrki Heinonen, Aija Vahtola, Samuel Antao, Sampo Pyysalo
cs.AI
摘要
大型语言模型(LLMs)在自然语言处理及其他领域的许多任务中表现出色,但大多数开放模型对较小语言的覆盖范围非常有限,而且LLM工作往往集中在那些几乎可以获取无限数据进行预训练的语言上。在这项工作中,我们研究了为芬兰语创建LLMs所面临的挑战,芬兰语是世界人口中不到0.1%的人口使用的语言。我们编制了一个包括网络爬虫、新闻、社交媒体和电子书在内的大量芬兰语数据集。我们采用两种方法来预训练模型:1)我们从头开始训练了七个单语模型(参数范围从1.86亿到130亿),命名为FinGPT;2)我们继续在多语言BLOOM模型上进行预训练,使用其原始训练数据和芬兰语的混合数据,最终形成了一个拥有1760亿参数的模型,我们称之为BLUUMI。为了评估模型,我们引入了FIN-bench,这是一个包含芬兰语任务的BIG-bench版本。我们还评估了其他模型特性,如有害性和偏见。我们的模型和工具可在https://turkunlp.org/gpt3-finnish 公开获取。
English
Large language models (LLMs) excel in many tasks in NLP and beyond, but most
open models have very limited coverage of smaller languages and LLM work tends
to focus on languages where nearly unlimited data is available for pretraining.
In this work, we study the challenges of creating LLMs for Finnish, a language
spoken by less than 0.1% of the world population. We compile an extensive
dataset of Finnish combining web crawls, news, social media and eBooks. We
pursue two approaches to pretrain models: 1) we train seven monolingual models
from scratch (186M to 13B parameters) dubbed FinGPT, 2) we continue the
pretraining of the multilingual BLOOM model on a mix of its original training
data and Finnish, resulting in a 176 billion parameter model we call BLUUMI.
For model evaluation, we introduce FIN-bench, a version of BIG-bench with
Finnish tasks. We also assess other model qualities such as toxicity and bias.
Our models and tools are openly available at https://turkunlp.org/gpt3-finnish.