ChatPaper.aiChatPaper

ALLaM 34B的UI層級評估:透過HUMAIN聊天測量阿拉伯語中心的大型語言模型

UI-Level Evaluation of ALLaM 34B: Measuring an Arabic-Centric LLM via HUMAIN Chat

August 24, 2025
作者: Omer Nacar
cs.AI

摘要

主要基於英語語料庫訓練的大型語言模型(LLMs)在捕捉阿拉伯語的語言和文化細微差別方面往往表現欠佳。為彌補這一差距,沙特數據與人工智能管理局(SDAIA)推出了專注於阿拉伯語的ALLaM系列模型。其中面向公眾的最強模型ALLaM-34B,隨後被HUMAIN採用,並在此基礎上開發並部署了HUMAIN Chat——一個基於該模型的封閉式對話網絡服務。本文對ALLaM-34B進行了擴展且精細化的用戶界面層面評估。通過使用一套涵蓋現代標準阿拉伯語、五種地區方言、語碼轉換、事實知識、算術與時間推理、創意生成以及對抗性安全性的提示包,我們收集了115個輸出(23個提示各運行5次),並由三個前沿LLM評判者(GPT-5、Gemini 2.5 Pro、Claude Sonnet-4)對每個輸出進行評分。我們計算了各類別的平均分並給出95%置信區間,分析了分數分佈,並可視化了方言維度的指標熱圖。更新後的分析顯示,ALLaM-34B在生成和語碼轉換任務上持續表現出色(平均分均為4.92/5),同時在處理現代標準阿拉伯語(4.74/5)、穩固的推理能力(4.64/5)以及提升的方言忠實度(4.21/5)方面也展現出強勁實力。與安全相關的提示表現穩定可靠(4.54/5)。綜合來看,這些結果確立了ALLaM-34B作為一個堅實且文化根基深厚的阿拉伯語LLM的地位,既展示了其技術實力,也證明了其在實際部署中的實用性準備就緒。
English
Large language models (LLMs) trained primarily on English corpora often struggle to capture the linguistic and cultural nuances of Arabic. To address this gap, the Saudi Data and AI Authority (SDAIA) introduced the ALLaM family of Arabic-focused models. The most capable of these available to the public, ALLaM-34B, was subsequently adopted by HUMAIN, who developed and deployed HUMAIN Chat, a closed conversational web service built on this model. This paper presents an expanded and refined UI-level evaluation of ALLaM-34B. Using a prompt pack spanning modern standard Arabic, five regional dialects, code-switching, factual knowledge, arithmetic and temporal reasoning, creative generation, and adversarial safety, we collected 115 outputs (23 prompts times 5 runs) and scored each with three frontier LLM judges (GPT-5, Gemini 2.5 Pro, Claude Sonnet-4). We compute category-level means with 95\% confidence intervals, analyze score distributions, and visualize dialect-wise metric heat maps. The updated analysis reveals consistently high performance on generation and code-switching tasks (both averaging 4.92/5), alongside strong results in MSA handling (4.74/5), solid reasoning ability (4.64/5), and improved dialect fidelity (4.21/5). Safety-related prompts show stable, reliable performance of (4.54/5). Taken together, these results position ALLaM-34B as a robust and culturally grounded Arabic LLM, demonstrating both technical strength and practical readiness for real-world deployment.
PDF62September 2, 2025