IberBench: 이베리아 언어에 대한 대형 언어 모델 평가

초록

대규모 언어 모델(LLM)을 포괄적으로 평가하는 것은 여전히 어려운 과제이며, 특히 고품질 데이터가 부족한 영어 이외의 언어에서는 더욱 그러합니다. 기존 벤치마크와 리더보드는 주로 영어 중심으로 구성되어 있으며, 다른 언어를 다루는 경우는 극소수에 불과합니다. 이러한 벤치마크는 몇 가지 주요 영역에서 부족함을 보입니다: 언어 다양성을 간과하고, 산업적 관련성이 높은 작업보다 기본적인 자연어 처리(NLP) 능력을 우선시하며, 정적(static)이라는 점입니다. 이러한 측면을 고려하여, 우리는 이베리아 반도와 이베로아메리카 전역에서 사용되는 언어들에 대해 기본적 및 산업적 관련성이 높은 NLP 작업에서의 LLM 성능을 평가하기 위해 포괄적이고 확장 가능한 벤치마크인 IberBench를 제안합니다. IberBench는 평가 캠페인과 최근 벤치마크에서 수집된 101개의 데이터셋을 통합하며, 감정 및 감정 분석, 유해성 탐지, 요약 등 22개의 작업 범주를 다룹니다. 이 벤치마크는 현재의 평가 관행에서 나타나는 주요 한계점, 예를 들어 언어 다양성의 부족과 정적 평가 설정 등을 해결하기 위해 지속적인 업데이트와 전문가 위원회가 관리하는 커뮤니티 주도의 모델 및 데이터셋 제출을 가능하게 합니다. 우리는 1억에서 140억 파라미터에 이르는 23개의 LLM을 평가하고, 그들의 강점과 한계에 대한 실증적 통찰을 제공합니다. 우리의 연구 결과는 (i) LLM이 기본 작업보다 산업적 관련성이 높은 작업에서 더 낮은 성능을 보인다는 점, (ii) 갈리시아어와 바스크어에서 평균적으로 성능이 더 낮다는 점, (iii) 일부 작업에서는 무작위 수준에 가까운 결과를 보인다는 점, (iv) 다른 작업에서는 무작위 수준보다는 높지만 공유 작업 시스템보다는 낮은 성능을 보인다는 점을 나타냅니다. IberBench는 데이터셋 정규화 및 호스팅, LLM의 증분 평가, 공개적으로 접근 가능한 리더보드를 포함한 전체 평가 파이프라인에 대한 오픈소스 구현을 제공합니다.

English

Large Language Models (LLMs) remain difficult to evaluate comprehensively, particularly for languages other than English, where high-quality data is often limited. Existing benchmarks and leaderboards are predominantly English-centric, with only a few addressing other languages. These benchmarks fall short in several key areas: they overlook the diversity of language varieties, prioritize fundamental Natural Language Processing (NLP) capabilities over tasks of industrial relevance, and are static. With these aspects in mind, we present IberBench, a comprehensive and extensible benchmark designed to assess LLM performance on both fundamental and industry-relevant NLP tasks, in languages spoken across the Iberian Peninsula and Ibero-America. IberBench integrates 101 datasets from evaluation campaigns and recent benchmarks, covering 22 task categories such as sentiment and emotion analysis, toxicity detection, and summarization. The benchmark addresses key limitations in current evaluation practices, such as the lack of linguistic diversity and static evaluation setups by enabling continual updates and community-driven model and dataset submissions moderated by a committee of experts. We evaluate 23 LLMs ranging from 100 million to 14 billion parameters and provide empirical insights into their strengths and limitations. Our findings indicate that (i) LLMs perform worse on industry-relevant tasks than in fundamental ones, (ii) performance is on average lower for Galician and Basque, (iii) some tasks show results close to random, and (iv) in other tasks LLMs perform above random but below shared task systems. IberBench offers open-source implementations for the entire evaluation pipeline, including dataset normalization and hosting, incremental evaluation of LLMs, and a publicly accessible leaderboard.

IberBench: 이베리아 언어에 대한 대형 언어 모델 평가

IberBench: LLM Evaluation on Iberian Languages

초록

Support