AraLingBench 一個人工標註的基準測試集，用於評估大型語言模型在阿拉伯語語言能力上的表現

摘要

我們推出AraLingBench：一個完全由人工註解的基準測試，旨在評估大型語言模型（LLMs）的阿拉伯語語言能力。該基準涵蓋五大核心範疇：語法、詞法、拼寫、閱讀理解及句法，透過150道專家設計的選擇題，直接檢驗對語言結構的理解。對35個阿拉伯語及雙語LLMs的評估顯示，當前模型在表層熟練度上表現出色，但在深層語法與句法推理方面仍顯不足。AraLingBench揭示了基於知識的基準測試高分與真正語言掌握之間持續存在的差距，表明許多模型的成功依賴於記憶或模式識別，而非真正的理解。通過隔離並測量基礎語言技能，AraLingBench為開發阿拉伯語LLMs提供了一個診斷框架。完整的評估代碼已在GitHub上公開。

English

We present AraLingBench: a fully human annotated benchmark for evaluating the Arabic linguistic competence of large language models (LLMs). The benchmark spans five core categories: grammar, morphology, spelling, reading comprehension, and syntax, through 150 expert-designed multiple choice questions that directly assess structural language understanding. Evaluating 35 Arabic and bilingual LLMs reveals that current models demonstrate strong surface level proficiency but struggle with deeper grammatical and syntactic reasoning. AraLingBench highlights a persistent gap between high scores on knowledge-based benchmarks and true linguistic mastery, showing that many models succeed through memorization or pattern recognition rather than authentic comprehension. By isolating and measuring fundamental linguistic skills, AraLingBench provides a diagnostic framework for developing Arabic LLMs. The full evaluation code is publicly available on GitHub.

AraLingBench 一個人工標註的基準測試集，用於評估大型語言模型在阿拉伯語語言能力上的表現

AraLingBench A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models

摘要

Support