非洲語言實驗室:以協作方式推進低資源非洲自然語言處理
The African Languages Lab: A Collaborative Approach to Advancing Low-Resource African NLP
October 7, 2025
作者: Sheriff Issaka, Keyi Wang, Yinka Ajibola, Oluwatumininu Samuel-Ipaye, Zhaoyi Zhang, Nicte Aguillon Jimenez, Evans Kofi Agyei, Abraham Lin, Rohan Ramachandran, Sadick Abdul Mumin, Faith Nchifor, Mohammed Shuraim, Lieqi Liu, Erick Rosas Gonzalez, Sylvester Kpei, Jemimah Osei, Carlene Ajeneza, Persis Boateng, Prisca Adwoa Dufie Yeboah, Saadia Gabriel
cs.AI
摘要
儘管非洲語言佔全球語言總數近三分之一,但在現代自然語言處理(NLP)技術中,這些語言卻嚴重缺乏支持,其中88%被歸類為在計算語言學中極度代表性不足或完全被忽視。我們在此介紹非洲語言實驗室(All Lab),這是一項全面的研究計劃,旨在通過系統化的數據收集、模型開發和能力建設來彌補這一技術鴻溝。我們的主要貢獻包括:(1)一個質量控制的數據收集流程,產生了迄今最大的非洲多模態語音和文本數據集,涵蓋40種語言,包含190億個單語文本標記和12,628小時的對齊語音數據;(2)廣泛的實驗驗證,表明我們的數據集結合微調後,相較於基準模型取得了顯著提升,在31種評估語言中平均提高了23.69 ChrF++、0.33 COMET和15.34 BLEU分數;(3)一個結構化的研究計劃,成功指導了十五位早期職業研究人員,建立了可持續的本地能力。我們與Google翻譯的對比評估顯示,在多種語言中表現出競爭力,同時也指出了需要持續改進的領域。
English
Despite representing nearly one-third of the world's languages, African
languages remain critically underserved by modern NLP technologies, with 88\%
classified as severely underrepresented or completely ignored in computational
linguistics. We present the African Languages Lab (All Lab), a comprehensive
research initiative that addresses this technological gap through systematic
data collection, model development, and capacity building. Our contributions
include: (1) a quality-controlled data collection pipeline, yielding the
largest validated African multi-modal speech and text dataset spanning 40
languages with 19 billion tokens of monolingual text and 12,628 hours of
aligned speech data; (2) extensive experimental validation demonstrating that
our dataset, combined with fine-tuning, achieves substantial improvements over
baseline models, averaging +23.69 ChrF++, +0.33 COMET, and +15.34 BLEU points
across 31 evaluated languages; and (3) a structured research program that has
successfully mentored fifteen early-career researchers, establishing
sustainable local capacity. Our comparative evaluation against Google Translate
reveals competitive performance in several languages while identifying areas
that require continued development.