DIWALI - 인도 문화 특정 항목을 위한 다양성 및 포용성 인식: 인도 맥락에서의 문화적 텍스트 적응을 위한 데이터셋 및 대형 언어 모델 평가

초록

대형 언어 모델(LLMs)은 다양한 작업과 응용 분야에서 널리 사용되고 있다. 그러나 그들의 광범위한 능력에도 불구하고, 문화적 지식과 역량의 부족으로 인해 문화적 정렬이 부족하고 편향된 생성물을 만들어내는 것으로 나타났다(ryan-etal-2024-unintended, alkhamissi-etal-2024-investigating, naous-etal-2024-beer). LLMs의 문화적 인식과 정렬을 평가하는 것은 적절한 평가 지표의 부족과 지역 및 하위 지역 수준에서 문화의 복잡성을 반영한 문화적으로 기반을 둔 데이터셋의 부재로 인해 특히 어려운 과제이다. 기존의 문화 특정 항목(CSIs)에 대한 데이터셋은 주로 지역 수준의 개념에 초점을 맞추고 있으며, 거짓 양성을 포함할 수 있다. 이러한 문제를 해결하기 위해, 우리는 17개의 문화적 측면에 속하는 인도 문화를 위한 새로운 CSI 데이터셋을 소개한다. 이 데이터셋은 36개의 하위 지역에서 수집된 8,000개의 문화적 개념으로 구성되어 있다. 문화적 텍스트 적응 작업에서 LLMs의 문화적 역량을 측정하기 위해, 우리는 생성된 CSIs, LLM as Judge, 그리고 다양한 사회-인구학적 지역에서의 인간 평가를 사용하여 적응을 평가한다. 또한, 우리는 모든 고려된 LLMs에 걸쳐 선택적 하위 지역 커버리지와 표면적 적응을 보여주는 정량적 분석을 수행한다. 우리의 데이터셋은 여기에서 확인할 수 있다: https://huggingface.co/datasets/nlip/DIWALI, 프로젝트 웹페이지\href{https://nlip-lab.github.io/nlip/publications/diwali/}, 그리고 모델 출력물과 함께 우리의 코드베이스는 여기에서 찾을 수 있다: https://github.com/pramitsahoo/culture-evaluation.

English

Large language models (LLMs) are widely used in various tasks and applications. However, despite their wide capabilities, they are shown to lack cultural alignment ryan-etal-2024-unintended, alkhamissi-etal-2024-investigating and produce biased generations naous-etal-2024-beer due to a lack of cultural knowledge and competence. Evaluation of LLMs for cultural awareness and alignment is particularly challenging due to the lack of proper evaluation metrics and unavailability of culturally grounded datasets representing the vast complexity of cultures at the regional and sub-regional levels. Existing datasets for culture specific items (CSIs) focus primarily on concepts at the regional level and may contain false positives. To address this issue, we introduce a novel CSI dataset for Indian culture, belonging to 17 cultural facets. The dataset comprises sim8k cultural concepts from 36 sub-regions. To measure the cultural competence of LLMs on a cultural text adaptation task, we evaluate the adaptations using the CSIs created, LLM as Judge, and human evaluations from diverse socio-demographic region. Furthermore, we perform quantitative analysis demonstrating selective sub-regional coverage and surface-level adaptations across all considered LLMs. Our dataset is available here: https://huggingface.co/datasets/nlip/DIWALI{https://huggingface.co/datasets/nlip/DIWALI}, project webpage\href{https://nlip-lab.github.io/nlip/publications/diwali/{https://nlip-lab.github.io/nlip/publications/diwali/}}, and our codebase with model outputs can be found here: https://github.com/pramitsahoo/culture-evaluation{https://github.com/pramitsahoo/culture-evaluation}.