아프리카 언어 연구소: 저자원 아프리카 자연어 처리 발전을 위한 협력적 접근

초록

전 세계 언어의 거의 3분의 1을 차지함에도 불구하고, 아프리카 언어들은 현대 NLP 기술로부터 심각하게 소외되어 있으며, 88%가 계산 언어학에서 심각하게 저평가되거나 완전히 무시되고 있습니다. 우리는 이러한 기술적 격차를 체계적인 데이터 수집, 모델 개발, 역량 강화를 통해 해결하기 위한 포괄적인 연구 이니셔티브인 아프리카 언어 연구소(All Lab)를 소개합니다. 우리의 주요 기여는 다음과 같습니다: (1) 40개 언어에 걸쳐 190억 토큰의 단일 언어 텍스트와 12,628시간의 정렬된 음성 데이터를 포함한 최대 규모의 검증된 아프리카 다중 모드 음성 및 텍스트 데이터셋을 산출하는 품질 관리 데이터 수집 파이프라인; (2) 우리의 데이터셋과 미세 조정을 결합하여 기준 모델 대비 평균 +23.69 ChrF++, +0.33 COMET, +15.34 BLEU 점수의 상당한 개선을 달성한 31개 평가 언어에 대한 광범위한 실험적 검증; (3) 15명의 초기 경력 연구자를 성공적으로 멘토링하며 지속 가능한 지역 역량을 구축한 구조화된 연구 프로그램. Google 번역과의 비교 평가 결과, 여러 언어에서 경쟁력 있는 성능을 보이면서도 지속적인 개발이 필요한 영역을 확인하였습니다.

English

Despite representing nearly one-third of the world's languages, African languages remain critically underserved by modern NLP technologies, with 88\% classified as severely underrepresented or completely ignored in computational linguistics. We present the African Languages Lab (All Lab), a comprehensive research initiative that addresses this technological gap through systematic data collection, model development, and capacity building. Our contributions include: (1) a quality-controlled data collection pipeline, yielding the largest validated African multi-modal speech and text dataset spanning 40 languages with 19 billion tokens of monolingual text and 12,628 hours of aligned speech data; (2) extensive experimental validation demonstrating that our dataset, combined with fine-tuning, achieves substantial improvements over baseline models, averaging +23.69 ChrF++, +0.33 COMET, and +15.34 BLEU points across 31 evaluated languages; and (3) a structured research program that has successfully mentored fifteen early-career researchers, establishing sustainable local capacity. Our comparative evaluation against Google Translate reveals competitive performance in several languages while identifying areas that require continued development.

아프리카 언어 연구소: 저자원 아프리카 자연어 처리 발전을 위한 협력적 접근

The African Languages Lab: A Collaborative Approach to Advancing Low-Resource African NLP

초록

Support