尼罗河对话:面向本地社区的语言多样性与文化感知大语言模型
NileChat: Towards Linguistically Diverse and Culturally Aware LLMs for Local Communities
May 23, 2025
作者: Abdellah El Mekki, Houdaifa Atou, Omer Nacar, Shady Shehata, Muhammad Abdul-Mageed
cs.AI
摘要
提升大型语言模型(LLMs)的语言能力以涵盖低资源语言,是一个至关重要的研究领域。当前的研究方向主要依赖于通过翻译英语语料库生成的合成数据,虽然这些数据展示了良好的语言理解和翻译能力,但往往导致模型与源语言文化对齐。这些模型常常无法体现当地社区的文化遗产和价值观。本研究提出了一种方法,旨在创建既包含合成数据又基于检索的预训练数据,这些数据专门针对特定社区,考虑其(i)语言,(ii)文化遗产,以及(iii)文化价值观。我们以埃及和摩洛哥方言为测试平台,展示了我们的方法,选择它们是因为其语言和文化的丰富性,以及目前在LLMs中的代表性不足。作为概念验证,我们开发了NileChat,一个拥有30亿参数的LLM,专为埃及和摩洛哥社区定制,融入了他们的语言、文化遗产和价值观。我们在各种理解、翻译、文化及价值观对齐基准测试中的结果表明,NileChat在性能上超越了现有相似规模的阿拉伯语感知LLMs,并与更大规模的模型表现相当。我们向社区分享我们的方法、数据和模型,以促进在LLM开发中纳入和覆盖更多元化的社区。
English
Enhancing the linguistic capabilities of Large Language Models (LLMs) to
include low-resource languages is a critical research area. Current research
directions predominantly rely on synthetic data generated by translating
English corpora, which, while demonstrating promising linguistic understanding
and translation abilities, often results in models aligned with source language
culture. These models frequently fail to represent the cultural heritage and
values of local communities. This work proposes a methodology to create both
synthetic and retrieval-based pre-training data tailored to a specific
community, considering its (i) language, (ii) cultural heritage, and (iii)
cultural values. We demonstrate our methodology using Egyptian and Moroccan
dialects as testbeds, chosen for their linguistic and cultural richness and
current underrepresentation in LLMs. As a proof-of-concept, we develop
NileChat, a 3B parameter LLM adapted for Egyptian and Moroccan communities,
incorporating their language, cultural heritage, and values. Our results on
various understanding, translation, and cultural and values alignment
benchmarks show that NileChat outperforms existing Arabic-aware LLMs of similar
size and performs on par with larger models. We share our methods, data, and
models with the community to promote the inclusion and coverage of more diverse
communities in LLM development.Summary
AI-Generated Summary