NileChat：地域コミュニティのための言語的多様性と文化的認識を備えた大規模言語モデルに向けて

要旨

低リソース言語を含む大規模言語モデル（LLMs）の言語能力を向上させることは、重要な研究領域です。現在の研究の方向性は、主に英語コーパスを翻訳して生成された合成データに依存しています。これにより、有望な言語理解と翻訳能力が示される一方で、モデルがソース言語の文化に沿ったものになることが多く、ローカルコミュニティの文化的遺産や価値観を十分に反映できないことが頻繁にあります。本研究では、特定のコミュニティに合わせた合成データと検索ベースの事前学習データを作成する方法論を提案します。この方法論は、(i) 言語、(ii) 文化的遺産、(iii) 文化的価値観を考慮に入れています。私たちは、エジプトとモロッコの方言をテストベッドとして使用し、その言語的・文化的豊かさと、現在のLLMsにおける過小評価を理由に選びました。概念実証として、エジプトとモロッコのコミュニティに適応した3BパラメータのLLMであるNileChatを開発し、彼らの言語、文化的遺産、価値観を取り入れました。理解、翻訳、文化的および価値観の整合性に関するさまざまなベンチマークでの結果は、NileChatが同規模の既存のアラビア語対応LLMを上回り、より大規模なモデルと同等の性能を発揮することを示しています。私たちは、方法論、データ、モデルをコミュニティと共有し、LLM開発においてより多様なコミュニティの包含とカバレッジを促進します。

English

Enhancing the linguistic capabilities of Large Language Models (LLMs) to include low-resource languages is a critical research area. Current research directions predominantly rely on synthetic data generated by translating English corpora, which, while demonstrating promising linguistic understanding and translation abilities, often results in models aligned with source language culture. These models frequently fail to represent the cultural heritage and values of local communities. This work proposes a methodology to create both synthetic and retrieval-based pre-training data tailored to a specific community, considering its (i) language, (ii) cultural heritage, and (iii) cultural values. We demonstrate our methodology using Egyptian and Moroccan dialects as testbeds, chosen for their linguistic and cultural richness and current underrepresentation in LLMs. As a proof-of-concept, we develop NileChat, a 3B parameter LLM adapted for Egyptian and Moroccan communities, incorporating their language, cultural heritage, and values. Our results on various understanding, translation, and cultural and values alignment benchmarks show that NileChat outperforms existing Arabic-aware LLMs of similar size and performs on par with larger models. We share our methods, data, and models with the community to promote the inclusion and coverage of more diverse communities in LLM development.

NileChat：地域コミュニティのための言語的多様性と文化的認識を備えた大規模言語モデルに向けて

NileChat: Towards Linguistically Diverse and Culturally Aware LLMs for Local Communities

要旨

Support