ガイドラインから実践へ：アラビア語言語モデル評価の新たなパラダイム

要旨

本論文は、包括的な理論的ガイドラインを確立し、新たな評価フレームワークを導入することで、アラビア語言語モデル評価における重要なギャップに対処する。まず、既存のアラビア語評価データセットを分析し、言語的精度、文化的整合性、方法論的厳密性における重大な問題を特定する。これらの制約を大規模言語モデル（LLM）において解決するため、アラビア語深度ミニデータセット（ADMD）を提示する。ADMDは、10の主要ドメイン（42のサブドメイン、図1参照）にわたる490の挑戦的な質問を慎重に選定したコレクションである。ADMDを用いて、GPT-4、Claude 3.5 Sonnet、Gemini Flash 1.5、CommandR 100B、Qwen-Maxの5つの主要な言語モデルを評価する。その結果、異なるドメイン間でモデルの性能に大きなばらつきが見られ、深い文化的理解と専門知識を必要とする領域で特に課題が顕著であった。Claude 3.5 Sonnetは全体の精度で30％と最も高く、アラビア語の数学理論、アラビア語、イスラム教のドメインで相対的な強さを示した。本研究は、アラビア語言語モデル評価の改善に向けた理論的基盤と実践的洞察を提供し、技術的能力と並んで文化的コンピテンスの重要性を強調するものである。

English

This paper addresses critical gaps in Arabic language model evaluation by establishing comprehensive theoretical guidelines and introducing a novel evaluation framework. We first analyze existing Arabic evaluation datasets, identifying significant issues in linguistic accuracy, cultural alignment, and methodological rigor. To address these limitations in LLMs, we present the Arabic Depth Mini Dataset (ADMD), a carefully curated collection of 490 challenging questions spanning ten major domains (42 sub-domains, see Figure 1. Using ADMD, we evaluate five leading language models: GPT-4, Claude 3.5 Sonnet, Gemini Flash 1.5, CommandR 100B, and Qwen-Max. Our results reveal significant variations in model performance across different domains, with particular challenges in areas requiring deep cultural understanding and specialized knowledge. Claude 3.5 Sonnet demonstrated the highest overall accuracy at 30\%, showing relative strength in mathematical theory in Arabic, Arabic language, and islamic domains. This work provides both theoretical foundations and practical insights for improving Arabic language model evaluation, emphasizing the importance of cultural competence alongside technical capabilities.

ガイドラインから実践へ：アラビア語言語モデル評価の新たなパラダイム

From Guidelines to Practice: A New Paradigm for Arabic Language Model Evaluation

要旨

Support