NileChat: 지역 사회를 위한 언어적 다양성과 문화적 인식을 갖춘 대형 언어 모델로의 여정

초록

저자원 언어를 포함한 대형 언어 모델(LLMs)의 언어적 역량 강화는 중요한 연구 분야입니다. 현재의 연구 방향은 주로 영어 코퍼스를 번역하여 생성된 합성 데이터에 의존하고 있으며, 이는 언어 이해와 번역 능력에서 유망한 결과를 보여주지만, 종종 원본 언어의 문화에 맞춰진 모델을 생성합니다. 이러한 모델은 지역 사회의 문화 유산과 가치를 충분히 반영하지 못하는 경우가 많습니다. 본 연구는 특정 커뮤니티의 (i) 언어, (ii) 문화 유산, (iii) 문화적 가치를 고려하여 맞춤형 합성 및 검색 기반 사전 학습 데이터를 생성하는 방법론을 제안합니다. 우리는 이집트와 모로코 방언을 테스트베드로 선택하여 이 방법론을 시연하며, 이는 이들의 언어적, 문화적 풍부함과 현재 LLMs에서의 저조한 대표성을 고려한 것입니다. 개념 증명으로, 우리는 이집트와 모로코 커뮤니티의 언어, 문화 유산, 가치를 반영한 3B 파라미터의 LLM인 NileChat을 개발했습니다. 다양한 이해, 번역, 문화 및 가치 정렬 벤치마크에서의 결과는 NileChat이 유사한 크기의 기존 아랍어 인식 LLMs를 능가하며, 더 큰 모델과 동등한 성능을 보임을 나타냅니다. 우리는 더 다양한 커뮤니티의 포함과 커버리지를 촉진하기 위해 방법론, 데이터, 모델을 커뮤니티와 공유합니다.

English

Enhancing the linguistic capabilities of Large Language Models (LLMs) to include low-resource languages is a critical research area. Current research directions predominantly rely on synthetic data generated by translating English corpora, which, while demonstrating promising linguistic understanding and translation abilities, often results in models aligned with source language culture. These models frequently fail to represent the cultural heritage and values of local communities. This work proposes a methodology to create both synthetic and retrieval-based pre-training data tailored to a specific community, considering its (i) language, (ii) cultural heritage, and (iii) cultural values. We demonstrate our methodology using Egyptian and Moroccan dialects as testbeds, chosen for their linguistic and cultural richness and current underrepresentation in LLMs. As a proof-of-concept, we develop NileChat, a 3B parameter LLM adapted for Egyptian and Moroccan communities, incorporating their language, cultural heritage, and values. Our results on various understanding, translation, and cultural and values alignment benchmarks show that NileChat outperforms existing Arabic-aware LLMs of similar size and performs on par with larger models. We share our methods, data, and models with the community to promote the inclusion and coverage of more diverse communities in LLM development.

NileChat: 지역 사회를 위한 언어적 다양성과 문화적 인식을 갖춘 대형 언어 모델로의 여정

NileChat: Towards Linguistically Diverse and Culturally Aware LLMs for Local Communities

초록

Support