从指南到实践：阿拉伯语语言模型评估的新范式

摘要

本文通过建立全面的理论指导方针并引入一种新颖的评估框架，解决了阿拉伯语语言模型评估中的关键空白。我们首先分析了现有的阿拉伯语评估数据集，识别出在语言准确性、文化一致性及方法论严谨性方面存在的显著问题。针对大语言模型（LLMs）的这些局限，我们提出了阿拉伯深度迷你数据集（ADMD），这是一个精心策划的包含490个挑战性问题的集合，涵盖十大主要领域（42个子领域，见图1）。利用ADMD，我们评估了五款领先的语言模型：GPT-4、Claude 3.5 Sonnet、Gemini Flash 1.5、CommandR 100B和Qwen-Max。我们的研究结果显示，不同模型在各个领域的表现存在显著差异，特别是在需要深厚文化理解和专业知识领域面临挑战。Claude 3.5 Sonnet以30%的整体准确率表现最佳，在阿拉伯数学理论、阿拉伯语言及伊斯兰领域展现出相对优势。本研究不仅为提升阿拉伯语语言模型评估提供了理论基础，还强调了文化能力与技术能力并重的重要性，为实践提供了深刻洞见。

English

This paper addresses critical gaps in Arabic language model evaluation by establishing comprehensive theoretical guidelines and introducing a novel evaluation framework. We first analyze existing Arabic evaluation datasets, identifying significant issues in linguistic accuracy, cultural alignment, and methodological rigor. To address these limitations in LLMs, we present the Arabic Depth Mini Dataset (ADMD), a carefully curated collection of 490 challenging questions spanning ten major domains (42 sub-domains, see Figure 1. Using ADMD, we evaluate five leading language models: GPT-4, Claude 3.5 Sonnet, Gemini Flash 1.5, CommandR 100B, and Qwen-Max. Our results reveal significant variations in model performance across different domains, with particular challenges in areas requiring deep cultural understanding and specialized knowledge. Claude 3.5 Sonnet demonstrated the highest overall accuracy at 30\%, showing relative strength in mathematical theory in Arabic, Arabic language, and islamic domains. This work provides both theoretical foundations and practical insights for improving Arabic language model evaluation, emphasizing the importance of cultural competence alongside technical capabilities.

从指南到实践：阿拉伯语语言模型评估的新范式

From Guidelines to Practice: A New Paradigm for Arabic Language Model Evaluation

摘要

Support