尼罗河对话：面向本地社区的语言多样性与文化感知大语言模型

摘要

提升大型语言模型（LLMs）的语言能力以涵盖低资源语言，是一个至关重要的研究领域。当前的研究方向主要依赖于通过翻译英语语料库生成的合成数据，虽然这些数据展示了良好的语言理解和翻译能力，但往往导致模型与源语言文化对齐。这些模型常常无法体现当地社区的文化遗产和价值观。本研究提出了一种方法，旨在创建既包含合成数据又基于检索的预训练数据，这些数据专门针对特定社区，考虑其（i）语言，（ii）文化遗产，以及（iii）文化价值观。我们以埃及和摩洛哥方言为测试平台，展示了我们的方法，选择它们是因为其语言和文化的丰富性，以及目前在LLMs中的代表性不足。作为概念验证，我们开发了NileChat，一个拥有30亿参数的LLM，专为埃及和摩洛哥社区定制，融入了他们的语言、文化遗产和价值观。我们在各种理解、翻译、文化及价值观对齐基准测试中的结果表明，NileChat在性能上超越了现有相似规模的阿拉伯语感知LLMs，并与更大规模的模型表现相当。我们向社区分享我们的方法、数据和模型，以促进在LLM开发中纳入和覆盖更多元化的社区。

English

Enhancing the linguistic capabilities of Large Language Models (LLMs) to include low-resource languages is a critical research area. Current research directions predominantly rely on synthetic data generated by translating English corpora, which, while demonstrating promising linguistic understanding and translation abilities, often results in models aligned with source language culture. These models frequently fail to represent the cultural heritage and values of local communities. This work proposes a methodology to create both synthetic and retrieval-based pre-training data tailored to a specific community, considering its (i) language, (ii) cultural heritage, and (iii) cultural values. We demonstrate our methodology using Egyptian and Moroccan dialects as testbeds, chosen for their linguistic and cultural richness and current underrepresentation in LLMs. As a proof-of-concept, we develop NileChat, a 3B parameter LLM adapted for Egyptian and Moroccan communities, incorporating their language, cultural heritage, and values. Our results on various understanding, translation, and cultural and values alignment benchmarks show that NileChat outperforms existing Arabic-aware LLMs of similar size and performs on par with larger models. We share our methods, data, and models with the community to promote the inclusion and coverage of more diverse communities in LLM development.

尼罗河对话：面向本地社区的语言多样性与文化感知大语言模型

NileChat: Towards Linguistically Diverse and Culturally Aware LLMs for Local Communities

摘要

Support