IberBench：伊比利亚语言大模型评估

摘要

大型语言模型（LLMs）的全面评估仍具挑战性，尤其是在非英语语言领域，高质量数据往往有限。现有的基准测试和排行榜主要集中于英语，仅有少数涉及其他语言。这些基准测试在多个关键方面存在不足：忽视了语言多样性，优先考虑基础自然语言处理（NLP）能力而非工业相关任务，且多为静态评估。鉴于此，我们推出了IberBench，一个全面且可扩展的基准测试，旨在评估LLMs在伊比利亚半岛及伊比利亚美洲地区语言中，对基础及工业相关NLP任务的表现。IberBench整合了来自评估活动及近期基准测试的101个数据集，涵盖22个任务类别，如情感与情绪分析、毒性检测及摘要生成等。该基准测试通过支持持续更新及由专家委员会审核的社区驱动模型与数据集提交，解决了当前评估实践中缺乏语言多样性和静态评估设置等关键限制。我们评估了从1亿到140亿参数不等的23个LLMs，并提供了关于其优势与局限的实证洞察。研究发现：(i) LLMs在工业相关任务上的表现逊色于基础任务，(ii) 加利西亚语和巴斯克语的平均表现较低，(iii) 部分任务结果接近随机，(iv) 其他任务中LLMs表现虽高于随机但低于共享任务系统。IberBench提供了整个评估流程的开源实现，包括数据集标准化与托管、LLMs的增量评估，以及一个公开可访问的排行榜。

English

Large Language Models (LLMs) remain difficult to evaluate comprehensively, particularly for languages other than English, where high-quality data is often limited. Existing benchmarks and leaderboards are predominantly English-centric, with only a few addressing other languages. These benchmarks fall short in several key areas: they overlook the diversity of language varieties, prioritize fundamental Natural Language Processing (NLP) capabilities over tasks of industrial relevance, and are static. With these aspects in mind, we present IberBench, a comprehensive and extensible benchmark designed to assess LLM performance on both fundamental and industry-relevant NLP tasks, in languages spoken across the Iberian Peninsula and Ibero-America. IberBench integrates 101 datasets from evaluation campaigns and recent benchmarks, covering 22 task categories such as sentiment and emotion analysis, toxicity detection, and summarization. The benchmark addresses key limitations in current evaluation practices, such as the lack of linguistic diversity and static evaluation setups by enabling continual updates and community-driven model and dataset submissions moderated by a committee of experts. We evaluate 23 LLMs ranging from 100 million to 14 billion parameters and provide empirical insights into their strengths and limitations. Our findings indicate that (i) LLMs perform worse on industry-relevant tasks than in fundamental ones, (ii) performance is on average lower for Galician and Basque, (iii) some tasks show results close to random, and (iv) in other tasks LLMs perform above random but below shared task systems. IberBench offers open-source implementations for the entire evaluation pipeline, including dataset normalization and hosting, incremental evaluation of LLMs, and a publicly accessible leaderboard.

IberBench：伊比利亚语言大模型评估

IberBench: LLM Evaluation on Iberian Languages

摘要

Support