AraLingBench:一项人工标注的基准测试,用于评估大型语言模型的阿拉伯语语言学能力
AraLingBench A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models
November 18, 2025
作者: Mohammad Zbib, Hasan Abed Al Kader Hammoud, Sina Mukalled, Nadine Rizk, Fatima Karnib, Issam Lakkis, Ammar Mohanna, Bernard Ghanem
cs.AI
摘要
我们推出AraLingBench:一个完全由人工标注的基准测试,旨在评估大型语言模型(LLMs)的阿拉伯语语言能力。该基准涵盖五大核心类别:语法、词法、拼写、阅读理解和句法,通过150道专家设计的多项选择题直接评估对语言结构的理解。对35个阿拉伯语及双语LLMs的评估显示,当前模型在表层语言能力上表现出色,但在深层次语法和句法推理方面存在困难。AraLingBench凸显了知识型基准测试高分与真正语言掌握之间的持续差距,表明许多模型通过记忆或模式识别而非真实理解取得成功。通过分离并衡量基础语言技能,AraLingBench为开发阿拉伯语LLMs提供了一个诊断框架。完整的评估代码已在GitHub上公开。
English
We present AraLingBench: a fully human annotated benchmark for evaluating the Arabic linguistic competence of large language models (LLMs). The benchmark spans five core categories: grammar, morphology, spelling, reading comprehension, and syntax, through 150 expert-designed multiple choice questions that directly assess structural language understanding. Evaluating 35 Arabic and bilingual LLMs reveals that current models demonstrate strong surface level proficiency but struggle with deeper grammatical and syntactic reasoning. AraLingBench highlights a persistent gap between high scores on knowledge-based benchmarks and true linguistic mastery, showing that many models succeed through memorization or pattern recognition rather than authentic comprehension. By isolating and measuring fundamental linguistic skills, AraLingBench provides a diagnostic framework for developing Arabic LLMs. The full evaluation code is publicly available on GitHub.