가이드라인에서 실천으로: 아랍어 언어 모델 평가를 위한 새로운 패러다임

초록

본 논문은 아랍어 언어 모델 평가에서의 중요한 공백을 해소하기 위해 포괄적인 이론적 지침을 수립하고 새로운 평가 프레임워크를 소개한다. 먼저, 기존의 아랍어 평가 데이터셋을 분석하여 언어적 정확성, 문화적 정렬, 방법론적 엄밀성에서의 중대한 문제점을 확인하였다. 이러한 제한 사항을 해결하기 위해, 우리는 10개의 주요 도메인(42개의 하위 도메인, 그림 1 참조)에 걸친 490개의 도전적인 질문으로 구성된 아랍어 심층 미니 데이터셋(Arabic Depth Mini Dataset, ADMD)을 제시한다. ADMD를 사용하여 GPT-4, Claude 3.5 Sonnet, Gemini Flash 1.5, CommandR 100B, Qwen-Max 등 5개의 주요 언어 모델을 평가하였다. 평가 결과, 모델 성능은 도메인에 따라 상당한 차이를 보였으며, 특히 깊은 문화적 이해와 전문 지식을 요구하는 영역에서 어려움이 나타났다. Claude 3.5 Sonnet은 전체적으로 30%의 가장 높은 정확도를 보였으며, 아랍어 수학 이론, 아랍어 언어, 이슬람 도메인에서 상대적인 강점을 나타냈다. 이 연구는 아랍어 언어 모델 평가를 개선하기 위한 이론적 기반과 실질적인 통찰을 제공하며, 기술적 역량과 함께 문화적 역량의 중요성을 강조한다.

English

This paper addresses critical gaps in Arabic language model evaluation by establishing comprehensive theoretical guidelines and introducing a novel evaluation framework. We first analyze existing Arabic evaluation datasets, identifying significant issues in linguistic accuracy, cultural alignment, and methodological rigor. To address these limitations in LLMs, we present the Arabic Depth Mini Dataset (ADMD), a carefully curated collection of 490 challenging questions spanning ten major domains (42 sub-domains, see Figure 1. Using ADMD, we evaluate five leading language models: GPT-4, Claude 3.5 Sonnet, Gemini Flash 1.5, CommandR 100B, and Qwen-Max. Our results reveal significant variations in model performance across different domains, with particular challenges in areas requiring deep cultural understanding and specialized knowledge. Claude 3.5 Sonnet demonstrated the highest overall accuracy at 30\%, showing relative strength in mathematical theory in Arabic, Arabic language, and islamic domains. This work provides both theoretical foundations and practical insights for improving Arabic language model evaluation, emphasizing the importance of cultural competence alongside technical capabilities.

가이드라인에서 실천으로: 아랍어 언어 모델 평가를 위한 새로운 패러다임

From Guidelines to Practice: A New Paradigm for Arabic Language Model Evaluation

초록

Support