ALLaM 34B의 UI 수준 평가: HUMAIN 채팅을 통한 아랍어 중심 LLM 측정

초록

주로 영어 코퍼스로 훈련된 대형 언어 모델(LLM)은 아랍어의 언어적, 문화적 뉘앙스를 포착하는 데 종종 어려움을 겪습니다. 이러한 격차를 해소하기 위해 사우디 데이터 및 AI청(SDAIA)은 아랍어에 초점을 맞춘 ALLaM 모델 패밀리를 도입했습니다. 이 중 공개적으로 사용 가능한 가장 강력한 모델인 ALLaM-34B는 이후 HUMAIN에 의해 채택되었으며, 이 모델을 기반으로 한 폐쇄형 대화형 웹 서비스인 HUMAIN Chat이 개발 및 배포되었습니다. 본 논문은 ALLaM-34B에 대한 확장 및 개선된 UI 수준의 평가를 제시합니다. 현대 표준 아랍어, 5개의 지역 방언, 코드 스위칭, 사실 지식, 산술 및 시간적 추론, 창의적 생성, 적대적 안전성을 아우르는 프롬프트 팩을 사용하여 115개의 출력(23개 프롬프트 × 5회 실행)을 수집하고, 이를 세 개의 최첨단 LLM 평가자(GPT-5, Gemini 2.5 Pro, Claude Sonnet-4)로 점수를 매겼습니다. 범주별 평균을 95% 신뢰 구간으로 계산하고, 점수 분포를 분석하며, 방언별 메트릭 히트맵을 시각화했습니다. 업데이트된 분석 결과, 생성 및 코드 스위칭 작업에서 일관되게 높은 성능(평균 4.92/5)을 보였으며, 현대 표준 아랍어 처리(4.74/5), 견고한 추론 능력(4.64/5), 개선된 방언 충실도(4.21/5)에서도 강력한 결과를 나타냈습니다. 안전 관련 프롬프트에서는 안정적이고 신뢰할 수 있는 성능(4.54/5)을 보였습니다. 종합적으로, 이러한 결과는 ALLaM-34B가 기술적 강점과 실제 배포를 위한 실용적 준비 상태를 모두 갖춘 견고하고 문화적으로 기반을 둔 아랍어 LLM임을 입증합니다.

English

Large language models (LLMs) trained primarily on English corpora often struggle to capture the linguistic and cultural nuances of Arabic. To address this gap, the Saudi Data and AI Authority (SDAIA) introduced the ALLaM family of Arabic-focused models. The most capable of these available to the public, ALLaM-34B, was subsequently adopted by HUMAIN, who developed and deployed HUMAIN Chat, a closed conversational web service built on this model. This paper presents an expanded and refined UI-level evaluation of ALLaM-34B. Using a prompt pack spanning modern standard Arabic, five regional dialects, code-switching, factual knowledge, arithmetic and temporal reasoning, creative generation, and adversarial safety, we collected 115 outputs (23 prompts times 5 runs) and scored each with three frontier LLM judges (GPT-5, Gemini 2.5 Pro, Claude Sonnet-4). We compute category-level means with 95\% confidence intervals, analyze score distributions, and visualize dialect-wise metric heat maps. The updated analysis reveals consistently high performance on generation and code-switching tasks (both averaging 4.92/5), alongside strong results in MSA handling (4.74/5), solid reasoning ability (4.64/5), and improved dialect fidelity (4.21/5). Safety-related prompts show stable, reliable performance of (4.54/5). Taken together, these results position ALLaM-34B as a robust and culturally grounded Arabic LLM, demonstrating both technical strength and practical readiness for real-world deployment.