ALLaM 34B的界面级评估:通过HUMAIN聊天衡量阿拉伯语中心化大语言模型
UI-Level Evaluation of ALLaM 34B: Measuring an Arabic-Centric LLM via HUMAIN Chat
August 24, 2025
作者: Omer Nacar
cs.AI
摘要
主要基于英语语料库训练的大型语言模型(LLMs)在捕捉阿拉伯语的语言和文化细微差别方面往往力不从心。为填补这一空白,沙特数据与人工智能管理局(SDAIA)推出了专注于阿拉伯语的ALLaM系列模型。其中面向公众的最强版本ALLaM-34B,随后被HUMAIN采用,并基于此模型开发并部署了HUMAIN Chat——一个封闭的对话式网络服务。本文对ALLaM-34B进行了扩展且精细化的用户界面层级评估。通过一套涵盖现代标准阿拉伯语、五种地区方言、语码转换、事实知识、算术与时间推理、创意生成及对抗性安全性的提示集,我们收集了115个输出结果(23个提示各运行5次),并由三个前沿LLM评判者(GPT-5、Gemini 2.5 Pro、Claude Sonnet-4)进行评分。我们计算了类别层面的均值及95%置信区间,分析了分数分布,并可视化了方言维度的指标热图。更新后的分析显示,ALLaM-34B在生成和语码转换任务上表现持续优异(均分4.92/5),同时在现代标准阿拉伯语处理(4.74/5)、扎实的推理能力(4.64/5)以及改进的方言忠实度(4.21/5)方面也展现出强劲实力。安全性相关提示下的表现稳定可靠,得分为4.54/5。综合来看,这些结果确立了ALLaM-34B作为一个强大且文化根基深厚的阿拉伯语LLM的地位,既展现了技术实力,也证明了其在实际部署中的实用准备度。
English
Large language models (LLMs) trained primarily on English corpora often
struggle to capture the linguistic and cultural nuances of Arabic. To address
this gap, the Saudi Data and AI Authority (SDAIA) introduced the ALLaM family
of Arabic-focused models. The most capable of these available to the public,
ALLaM-34B, was subsequently adopted by HUMAIN, who developed and deployed
HUMAIN Chat, a closed conversational web service built on this model. This
paper presents an expanded and refined UI-level evaluation of ALLaM-34B.
Using a prompt pack spanning modern standard Arabic, five regional dialects,
code-switching, factual knowledge, arithmetic and temporal reasoning, creative
generation, and adversarial safety, we collected 115 outputs (23 prompts times
5 runs) and scored each with three frontier LLM judges (GPT-5, Gemini 2.5 Pro,
Claude Sonnet-4). We compute category-level means with 95\% confidence
intervals, analyze score distributions, and visualize dialect-wise metric heat
maps. The updated analysis reveals consistently high performance on generation
and code-switching tasks (both averaging 4.92/5), alongside strong results in
MSA handling (4.74/5), solid reasoning ability (4.64/5), and improved dialect
fidelity (4.21/5). Safety-related prompts show stable, reliable performance of
(4.54/5). Taken together, these results position ALLaM-34B as a robust and
culturally grounded Arabic LLM, demonstrating both technical strength and
practical readiness for real-world deployment.