InkubaLM：針對低資源非洲語言的小型語言模型

摘要

在非洲的情境中，高資源語言模型往往表現不佳，因為當地急需高效、易取得且具地方相關性的模型，即使在計算和數據方面存在顯著限制。本文介紹了InkubaLM，一個具有 0.4 十億參數的小型語言模型，其在機器翻譯、問答、AfriMMLU 和 AfriXnli 任務上實現了與參數數量遠大且訓練數據更廣泛的模型相媲美的性能。值得注意的是，InkubaLM 在情感分析方面勝過許多更大的模型，並在多種語言中展現出卓越的一致性。這項工作代表了挑戰傳統範式的重要進展，即有效的語言模型必須依賴豐富的資源。我們的模型和數據集公開可用，以鼓勵對低資源語言進行研究和開發。我们的模型和数据集公开可用\url{https://huggingface.co/lelapa}。

English

High-resource language models often fall short in the African context, where there is a critical need for models that are efficient, accessible, and locally relevant, even amidst significant computing and data constraints. This paper introduces InkubaLM, a small language model with 0.4 billion parameters, which achieves performance comparable to models with significantly larger parameter counts and more extensive training data on tasks such as machine translation, question-answering, AfriMMLU, and the AfriXnli task. Notably, InkubaLM outperforms many larger models in sentiment analysis and demonstrates remarkable consistency across multiple languages. This work represents a pivotal advancement in challenging the conventional paradigm that effective language models must rely on substantial resources. Our model and datasets are publicly available \url{https://huggingface.co/lelapa} to encourage research and development on low-resource languages.

InkubaLM：針對低資源非洲語言的小型語言模型

InkubaLM: A small language model for low-resource African languages

摘要

Support