尼羅河對話:邁向語言多樣性與文化意識的大語言模型,服務在地社群
NileChat: Towards Linguistically Diverse and Culturally Aware LLMs for Local Communities
May 23, 2025
作者: Abdellah El Mekki, Houdaifa Atou, Omer Nacar, Shady Shehata, Muhammad Abdul-Mageed
cs.AI
摘要
提升大型語言模型(LLMs)的語言能力,使其涵蓋低資源語言,是一個至關重要的研究領域。當前的研究方向主要依賴於通過翻譯英語語料庫生成的合成數據,這些數據雖然展示了良好的語言理解和翻譯能力,但往往導致模型與源語言文化保持一致。這些模型經常無法代表當地社區的文化遺產和價值觀。本研究提出了一種方法,旨在創建針對特定社區的合成和基於檢索的預訓練數據,考慮其(i)語言,(ii)文化遺產,以及(iii)文化價值觀。我們以埃及和摩洛哥方言為測試平台,展示了我們的方法,這些方言因其語言和文化的豐富性以及目前在LLMs中的代表性不足而被選中。作為概念驗證,我們開發了NileChat,這是一個擁有30億參數的LLM,專為埃及和摩洛哥社區量身定制,融入了他們的語言、文化遺產和價值觀。我們在各種理解、翻譯以及文化和價值觀對齊基準測試中的結果表明,NileChat在性能上超越了現有相似規模的阿拉伯語感知LLMs,並與更大模型表現相當。我們向社區分享我們的方法、數據和模型,以促進LLM開發中更多元化社區的包容性和覆蓋範圍。
English
Enhancing the linguistic capabilities of Large Language Models (LLMs) to
include low-resource languages is a critical research area. Current research
directions predominantly rely on synthetic data generated by translating
English corpora, which, while demonstrating promising linguistic understanding
and translation abilities, often results in models aligned with source language
culture. These models frequently fail to represent the cultural heritage and
values of local communities. This work proposes a methodology to create both
synthetic and retrieval-based pre-training data tailored to a specific
community, considering its (i) language, (ii) cultural heritage, and (iii)
cultural values. We demonstrate our methodology using Egyptian and Moroccan
dialects as testbeds, chosen for their linguistic and cultural richness and
current underrepresentation in LLMs. As a proof-of-concept, we develop
NileChat, a 3B parameter LLM adapted for Egyptian and Moroccan communities,
incorporating their language, cultural heritage, and values. Our results on
various understanding, translation, and cultural and values alignment
benchmarks show that NileChat outperforms existing Arabic-aware LLMs of similar
size and performs on par with larger models. We share our methods, data, and
models with the community to promote the inclusion and coverage of more diverse
communities in LLM development.Summary
AI-Generated Summary