尼羅河對話：邁向語言多樣性與文化意識的大語言模型，服務在地社群

摘要

提升大型語言模型（LLMs）的語言能力，使其涵蓋低資源語言，是一個至關重要的研究領域。當前的研究方向主要依賴於通過翻譯英語語料庫生成的合成數據，這些數據雖然展示了良好的語言理解和翻譯能力，但往往導致模型與源語言文化保持一致。這些模型經常無法代表當地社區的文化遺產和價值觀。本研究提出了一種方法，旨在創建針對特定社區的合成和基於檢索的預訓練數據，考慮其（i）語言，（ii）文化遺產，以及（iii）文化價值觀。我們以埃及和摩洛哥方言為測試平台，展示了我們的方法，這些方言因其語言和文化的豐富性以及目前在LLMs中的代表性不足而被選中。作為概念驗證，我們開發了NileChat，這是一個擁有30億參數的LLM，專為埃及和摩洛哥社區量身定制，融入了他們的語言、文化遺產和價值觀。我們在各種理解、翻譯以及文化和價值觀對齊基準測試中的結果表明，NileChat在性能上超越了現有相似規模的阿拉伯語感知LLMs，並與更大模型表現相當。我們向社區分享我們的方法、數據和模型，以促進LLM開發中更多元化社區的包容性和覆蓋範圍。

English

Enhancing the linguistic capabilities of Large Language Models (LLMs) to include low-resource languages is a critical research area. Current research directions predominantly rely on synthetic data generated by translating English corpora, which, while demonstrating promising linguistic understanding and translation abilities, often results in models aligned with source language culture. These models frequently fail to represent the cultural heritage and values of local communities. This work proposes a methodology to create both synthetic and retrieval-based pre-training data tailored to a specific community, considering its (i) language, (ii) cultural heritage, and (iii) cultural values. We demonstrate our methodology using Egyptian and Moroccan dialects as testbeds, chosen for their linguistic and cultural richness and current underrepresentation in LLMs. As a proof-of-concept, we develop NileChat, a 3B parameter LLM adapted for Egyptian and Moroccan communities, incorporating their language, cultural heritage, and values. Our results on various understanding, translation, and cultural and values alignment benchmarks show that NileChat outperforms existing Arabic-aware LLMs of similar size and performs on par with larger models. We share our methods, data, and models with the community to promote the inclusion and coverage of more diverse communities in LLM development.

尼羅河對話：邁向語言多樣性與文化意識的大語言模型，服務在地社群

NileChat: Towards Linguistically Diverse and Culturally Aware LLMs for Local Communities

摘要

Support