InkubaLM: 低リソースなアフリカ言語向けの小規模言語モデル

要旨

高リソースの言語モデルは、アフリカの状況ではしばしば不十分であり、効率的でアクセス可能であり、地域に適したモデルが必要とされています。この論文では、パラメータ数が0.4億の小規模な言語モデルであるInkubaLMを紹介し、機械翻訳、質問応答、AfriMMLU、およびAfriXnliタスクなどのタスクにおいて、大幅にパラメータ数が多く、より広範なトレーニングデータを持つモデルと同等の性能を達成しています。特筆すべきは、InkubaLMが感情分析で多くの大規模モデルを凌駕し、複数の言語にわたって顕著な一貫性を示していることです。この研究は、効果的な言語モデルは多大なリソースに依存する必要があるという従来のパラダイムに挑戦する画期的な進歩を表しています。当該モデルとデータセットは、低リソース言語に関する研究と開発を促進するために一般に公開されています\url{https://huggingface.co/lelapa}。

English

High-resource language models often fall short in the African context, where there is a critical need for models that are efficient, accessible, and locally relevant, even amidst significant computing and data constraints. This paper introduces InkubaLM, a small language model with 0.4 billion parameters, which achieves performance comparable to models with significantly larger parameter counts and more extensive training data on tasks such as machine translation, question-answering, AfriMMLU, and the AfriXnli task. Notably, InkubaLM outperforms many larger models in sentiment analysis and demonstrates remarkable consistency across multiple languages. This work represents a pivotal advancement in challenging the conventional paradigm that effective language models must rely on substantial resources. Our model and datasets are publicly available \url{https://huggingface.co/lelapa} to encourage research and development on low-resource languages.

InkubaLM: 低リソースなアフリカ言語向けの小規模言語モデル

InkubaLM: A small language model for low-resource African languages

要旨

Support