GlotEval: 대규모 다국어 평가를 위한 대형 언어 모델 테스트 스위트

초록

대형 언어 모델(LLM)은 전 세계적으로 전례 없는 속도로 발전하고 있으며, 각 지역은 이러한 모델을 주요 언어로 응용하기 위해 점점 더 많이 도입하고 있습니다. 특히 저자원 언어를 포함한 다양한 언어 환경에서 이러한 모델을 평가하는 것은 학계와 산업계 모두에게 주요한 과제가 되었습니다. 기존의 평가 프레임워크는 영어와 소수의 고자원 언어에 지나치게 초점을 맞추고 있어, 다국어 및 저자원 시나리오에서의 LLM 성능을 현실적으로 파악하는 데 한계가 있습니다. 이러한 격차를 해결하기 위해, 우리는 대규모 다국어 평가를 위해 설계된 경량 프레임워크인 GlotEval을 소개합니다. GlotEval은 기계 번역, 텍스트 분류, 요약, 개방형 생성, 독해, 시퀀스 레이블링, 내재적 평가 등 7가지 주요 작업을 지원하며, 수십 개에서 수백 개의 언어에 걸쳐 일관된 다국어 벤치마킹, 언어별 프롬프트 템플릿, 비영어 중심의 기계 번역을 강조합니다. 이를 통해 다양한 언어적 맥락에서 모델의 강점과 약점을 정확히 진단할 수 있습니다. 다국어 번역 사례 연구는 GlotEval이 다국어 및 언어별 평가에 적용 가능함을 보여줍니다.

English

Large language models (LLMs) are advancing at an unprecedented pace globally, with regions increasingly adopting these models for applications in their primary language. Evaluation of these models in diverse linguistic environments, especially in low-resource languages, has become a major challenge for academia and industry. Existing evaluation frameworks are disproportionately focused on English and a handful of high-resource languages, thereby overlooking the realistic performance of LLMs in multilingual and lower-resource scenarios. To address this gap, we introduce GlotEval, a lightweight framework designed for massively multilingual evaluation. Supporting seven key tasks (machine translation, text classification, summarization, open-ended generation, reading comprehension, sequence labeling, and intrinsic evaluation), spanning over dozens to hundreds of languages, GlotEval highlights consistent multilingual benchmarking, language-specific prompt templates, and non-English-centric machine translation. This enables a precise diagnosis of model strengths and weaknesses in diverse linguistic contexts. A multilingual translation case study demonstrates GlotEval's applicability for multilingual and language-specific evaluations.