AraLingBench 대형 언어 모델의 아랍어 언어 능력 평가를 위한 인간 주석 벤치마크

초록

우리는 대규모 언어 모델(LLM)의 아랍어 언어 능력을 평가하기 위한 완전히 인간이 주석을 단 벤치마크인 AraLingBench를 소개한다. 이 벤치마크는 문법, 형태론, 철자, 독해, 구문 등 다섯 가지 핵심 범주를 아우르며, 구조적 언어 이해를 직접 평가하는 150개의 전문가 설계 다중 선택 문제로 구성되어 있다. 35개의 아랍어 및 이중 언어 LLM을 평가한 결과, 현재 모델들은 표면적 수준에서는 강한 숙련도를 보이지만 더 깊은 문법적 및 구문적 추론에서는 어려움을 겪는 것으로 나타났다. AraLingBench는 지식 기반 벤치마크에서의 높은 점수와 진정한 언어 숙달 사이의 지속적인 격차를 강조하며, 많은 모델들이 진정한 이해보다는 암기나 패턴 인식을 통해 성공하고 있음을 보여준다. 기본적인 언어 능력을 분리하고 측정함으로써, AraLingBench는 아랍어 LLM 개발을 위한 진단 프레임워크를 제공한다. 전체 평가 코드는 GitHub에 공개되어 있다.

English

We present AraLingBench: a fully human annotated benchmark for evaluating the Arabic linguistic competence of large language models (LLMs). The benchmark spans five core categories: grammar, morphology, spelling, reading comprehension, and syntax, through 150 expert-designed multiple choice questions that directly assess structural language understanding. Evaluating 35 Arabic and bilingual LLMs reveals that current models demonstrate strong surface level proficiency but struggle with deeper grammatical and syntactic reasoning. AraLingBench highlights a persistent gap between high scores on knowledge-based benchmarks and true linguistic mastery, showing that many models succeed through memorization or pattern recognition rather than authentic comprehension. By isolating and measuring fundamental linguistic skills, AraLingBench provides a diagnostic framework for developing Arabic LLMs. The full evaluation code is publicly available on GitHub.

AraLingBench 대형 언어 모델의 아랍어 언어 능력 평가를 위한 인간 주석 벤치마크

AraLingBench A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models

초록

Support