DIWALI - インドにおける多様性と包括性を考慮した文化固有項目のデータセットとLLMの文化的テキスト適応評価

要旨

大規模言語モデル（LLMs）は、さまざまなタスクやアプリケーションで広く使用されている。しかし、その広範な能力にもかかわらず、文化的知識と能力の欠如により、文化的な整合性を欠き（ryan-etal-2024-unintended、alkhamissi-etal-2024-investigating）、偏った生成を行うことが示されている（naous-etal-2024-beer）。LLMsの文化的認識と整合性の評価は、適切な評価指標の欠如や、地域および準地域レベルでの文化の複雑さを反映した文化的に基づいたデータセットの不足により、特に困難である。既存の文化固有項目（CSIs）のデータセットは、主に地域レベルの概念に焦点を当てており、誤検出を含む可能性がある。この問題に対処するため、我々は17の文化的側面に属するインド文化の新しいCSIデータセットを導入する。このデータセットは、36の準地域からのsim8kの文化的概念で構成されている。文化的テキスト適応タスクにおけるLLMsの文化的能力を測定するために、作成したCSIs、LLM as Judge、および多様な社会人口統計学的地域からの人間評価を使用して適応を評価する。さらに、すべての考慮されたLLMsにおける選択的な準地域カバレッジと表面的な適応を示す定量分析を行う。我々のデータセットは以下で利用可能である：https://huggingface.co/datasets/nlip/DIWALI{https://huggingface.co/datasets/nlip/DIWALI}、プロジェクトウェブページ\href{https://nlip-lab.github.io/nlip/publications/diwali/{https://nlip-lab.github.io/nlip/publications/diwali/}}、およびモデル出力を含むコードベースは以下で見つけることができる：https://github.com/pramitsahoo/culture-evaluation{https://github.com/pramitsahoo/culture-evaluation}。

English

Large language models (LLMs) are widely used in various tasks and applications. However, despite their wide capabilities, they are shown to lack cultural alignment ryan-etal-2024-unintended, alkhamissi-etal-2024-investigating and produce biased generations naous-etal-2024-beer due to a lack of cultural knowledge and competence. Evaluation of LLMs for cultural awareness and alignment is particularly challenging due to the lack of proper evaluation metrics and unavailability of culturally grounded datasets representing the vast complexity of cultures at the regional and sub-regional levels. Existing datasets for culture specific items (CSIs) focus primarily on concepts at the regional level and may contain false positives. To address this issue, we introduce a novel CSI dataset for Indian culture, belonging to 17 cultural facets. The dataset comprises sim8k cultural concepts from 36 sub-regions. To measure the cultural competence of LLMs on a cultural text adaptation task, we evaluate the adaptations using the CSIs created, LLM as Judge, and human evaluations from diverse socio-demographic region. Furthermore, we perform quantitative analysis demonstrating selective sub-regional coverage and surface-level adaptations across all considered LLMs. Our dataset is available here: https://huggingface.co/datasets/nlip/DIWALI{https://huggingface.co/datasets/nlip/DIWALI}, project webpage\href{https://nlip-lab.github.io/nlip/publications/diwali/{https://nlip-lab.github.io/nlip/publications/diwali/}}, and our codebase with model outputs can be found here: https://github.com/pramitsahoo/culture-evaluation{https://github.com/pramitsahoo/culture-evaluation}.