ChatPaper.aiChatPaper

非洲语言实验室:推进低资源非洲自然语言处理的协作路径

The African Languages Lab: A Collaborative Approach to Advancing Low-Resource African NLP

October 7, 2025
作者: Sheriff Issaka, Keyi Wang, Yinka Ajibola, Oluwatumininu Samuel-Ipaye, Zhaoyi Zhang, Nicte Aguillon Jimenez, Evans Kofi Agyei, Abraham Lin, Rohan Ramachandran, Sadick Abdul Mumin, Faith Nchifor, Mohammed Shuraim, Lieqi Liu, Erick Rosas Gonzalez, Sylvester Kpei, Jemimah Osei, Carlene Ajeneza, Persis Boateng, Prisca Adwoa Dufie Yeboah, Saadia Gabriel
cs.AI

摘要

尽管非洲语言占全球语言总数的近三分之一,但在现代自然语言处理(NLP)技术中,这些语言却严重缺乏支持,其中88%被归类为在计算语言学领域严重代表性不足或完全被忽视。我们推出了非洲语言实验室(All Lab),这是一项全面的研究计划,旨在通过系统的数据收集、模型开发和能力建设来填补这一技术鸿沟。我们的贡献包括:(1)一个质量控制的数据收集流程,生成了涵盖40种语言的最大规模验证过的非洲多模态语音和文本数据集,包含190亿个单语文本标记和12,628小时的对齐语音数据;(2)广泛的实验验证表明,我们的数据集结合微调,相较于基线模型取得了显著提升,在31种评估语言中平均提高了23.69 ChrF++、0.33 COMET和15.34 BLEU分数;(3)一个结构化的研究项目,成功指导了十五位早期职业研究人员,建立了可持续的本地能力。我们与谷歌翻译的对比评估显示,在多种语言上具有竞争力,同时也识别出需要持续发展的领域。
English
Despite representing nearly one-third of the world's languages, African languages remain critically underserved by modern NLP technologies, with 88\% classified as severely underrepresented or completely ignored in computational linguistics. We present the African Languages Lab (All Lab), a comprehensive research initiative that addresses this technological gap through systematic data collection, model development, and capacity building. Our contributions include: (1) a quality-controlled data collection pipeline, yielding the largest validated African multi-modal speech and text dataset spanning 40 languages with 19 billion tokens of monolingual text and 12,628 hours of aligned speech data; (2) extensive experimental validation demonstrating that our dataset, combined with fine-tuning, achieves substantial improvements over baseline models, averaging +23.69 ChrF++, +0.33 COMET, and +15.34 BLEU points across 31 evaluated languages; and (3) a structured research program that has successfully mentored fifteen early-career researchers, establishing sustainable local capacity. Our comparative evaluation against Google Translate reveals competitive performance in several languages while identifying areas that require continued development.
PDF202October 9, 2025