ALLaM 34BのUIレベル評価：HUMAINチャットを通じたアラビア語中心LLMの測定

要旨

主に英語コーパスで訓練された大規模言語モデル（LLM）は、アラビア語の言語的・文化的ニュアンスを捉えるのに苦労することが多い。このギャップを埋めるため、サウジアラビアデータ・AI庁（SDAIA）はアラビア語に特化したALLaMファミリーモデルを導入した。その中で一般公開されている最も高性能なALLaM-34Bは、その後HUMAINによって採用され、このモデルを基に構築されたクローズドな会話型ウェブサービス「HUMAIN Chat」が開発・展開された。本論文では、ALLaM-34BのUIレベル評価を拡張・精緻化した結果を提示する。現代標準アラビア語、5つの地域方言、コードスイッチング、事実知識、算術および時間的推論、創造的生成、敵対的安全性を網羅するプロンプトパックを使用し、115の出力（23プロンプト×5回実行）を収集し、それぞれを3つの最先端LLM評価者（GPT-5、Gemini 2.5 Pro、Claude Sonnet-4）で採点した。カテゴリ別平均値を95％信頼区間で計算し、スコア分布を分析し、方言別メトリックのヒートマップを可視化した。更新された分析により、生成およびコードスイッチングタスクで一貫して高い性能（平均4.92/5）を示し、現代標準アラビア語の処理（4.74/5）、堅実な推論能力（4.64/5）、改善された方言忠実度（4.21/5）においても強力な結果が得られた。安全性関連のプロンプトでは安定した信頼性（4.54/5）を示している。これらの結果を総合すると、ALLaM-34Bは技術的強さと実世界での展開のための実用性を兼ね備えた、堅牢で文化的に根ざしたアラビア語LLMとして位置づけられる。

English

Large language models (LLMs) trained primarily on English corpora often struggle to capture the linguistic and cultural nuances of Arabic. To address this gap, the Saudi Data and AI Authority (SDAIA) introduced the ALLaM family of Arabic-focused models. The most capable of these available to the public, ALLaM-34B, was subsequently adopted by HUMAIN, who developed and deployed HUMAIN Chat, a closed conversational web service built on this model. This paper presents an expanded and refined UI-level evaluation of ALLaM-34B. Using a prompt pack spanning modern standard Arabic, five regional dialects, code-switching, factual knowledge, arithmetic and temporal reasoning, creative generation, and adversarial safety, we collected 115 outputs (23 prompts times 5 runs) and scored each with three frontier LLM judges (GPT-5, Gemini 2.5 Pro, Claude Sonnet-4). We compute category-level means with 95\% confidence intervals, analyze score distributions, and visualize dialect-wise metric heat maps. The updated analysis reveals consistently high performance on generation and code-switching tasks (both averaging 4.92/5), alongside strong results in MSA handling (4.74/5), solid reasoning ability (4.64/5), and improved dialect fidelity (4.21/5). Safety-related prompts show stable, reliable performance of (4.54/5). Taken together, these results position ALLaM-34B as a robust and culturally grounded Arabic LLM, demonstrating both technical strength and practical readiness for real-world deployment.