AraLingBench 大規模言語モデルのアラビア語言語能力を評価するための人間による注釈付きベンチマーク

要旨

我々は、大規模言語モデル（LLM）のアラビア語言語能力を評価するための完全に人間による注釈付きベンチマーク「AraLingBench」を提示する。このベンチマークは、文法、形態論、綴り、読解、構文の5つの主要カテゴリーにまたがり、構造的な言語理解を直接評価する150の専門家設計の多肢選択問題を通じて構成されている。35のアラビア語およびバイリンガルLLMを評価した結果、現在のモデルは表面的な熟練度を示すものの、深い文法的および構文的推論には苦戦していることが明らかとなった。AraLingBenchは、知識ベースのベンチマークでの高得点と真の言語習得との間に存在する持続的なギャップを浮き彫りにし、多くのモデルが記憶やパターン認識を通じて成功していることを示している。基本的な言語スキルを分離して測定することにより、AraLingBenchはアラビア語LLMの開発のための診断フレームワークを提供する。評価コードの全容はGitHubで公開されている。

English

We present AraLingBench: a fully human annotated benchmark for evaluating the Arabic linguistic competence of large language models (LLMs). The benchmark spans five core categories: grammar, morphology, spelling, reading comprehension, and syntax, through 150 expert-designed multiple choice questions that directly assess structural language understanding. Evaluating 35 Arabic and bilingual LLMs reveals that current models demonstrate strong surface level proficiency but struggle with deeper grammatical and syntactic reasoning. AraLingBench highlights a persistent gap between high scores on knowledge-based benchmarks and true linguistic mastery, showing that many models succeed through memorization or pattern recognition rather than authentic comprehension. By isolating and measuring fundamental linguistic skills, AraLingBench provides a diagnostic framework for developing Arabic LLMs. The full evaluation code is publicly available on GitHub.

AraLingBench 大規模言語モデルのアラビア語言語能力を評価するための人間による注釈付きベンチマーク

AraLingBench A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models

要旨

Support