IberBench：イベリア言語における大規模言語モデルの評価

要旨

大規模言語モデル（LLM）の包括的な評価は依然として困難であり、特に高品質なデータが限られている英語以外の言語ではその傾向が顕著です。既存のベンチマークやリーダーボードは主に英語中心であり、他の言語に対応したものはわずかです。これらのベンチマークにはいくつかの重要な課題があります：言語の多様性を見落としていること、基本的な自然言語処理（NLP）能力を産業関連のタスクよりも優先していること、そして静的であることです。これらの点を踏まえ、私たちはIberBenchを提案します。これは、イベリア半島やイベロアメリカで話される言語において、基本的なNLPタスクと産業関連のタスクの両方でLLMの性能を評価するための包括的かつ拡張可能なベンチマークです。IberBenchは、評価キャンペーンや最近のベンチマークから101のデータセットを統合し、感情分析、毒性検出、要約など22のタスクカテゴリをカバーしています。このベンチマークは、言語の多様性の欠如や静的評価設定といった現在の評価手法の主要な限界に対処し、専門家委員会によって管理される継続的な更新とコミュニティ主導のモデルおよびデータセットの提出を可能にします。私たちは、1億から140億パラメータまでの23のLLMを評価し、それらの強みと限界に関する実証的な洞察を提供します。私たちの調査結果は、(i) LLMは基本的なタスクよりも産業関連のタスクで性能が低いこと、(ii) ガリシア語とバスク語では平均的に性能が低いこと、(iii) 一部のタスクではランダムに近い結果を示すこと、(iv) 他のタスクではLLMがランダムを上回るが共有タスクシステムを下回る性能を示すことを示しています。IberBenchは、データセットの正規化とホスティング、LLMの増分評価、公開アクセス可能なリーダーボードを含む、評価パイプライン全体のオープンソース実装を提供します。

English

Large Language Models (LLMs) remain difficult to evaluate comprehensively, particularly for languages other than English, where high-quality data is often limited. Existing benchmarks and leaderboards are predominantly English-centric, with only a few addressing other languages. These benchmarks fall short in several key areas: they overlook the diversity of language varieties, prioritize fundamental Natural Language Processing (NLP) capabilities over tasks of industrial relevance, and are static. With these aspects in mind, we present IberBench, a comprehensive and extensible benchmark designed to assess LLM performance on both fundamental and industry-relevant NLP tasks, in languages spoken across the Iberian Peninsula and Ibero-America. IberBench integrates 101 datasets from evaluation campaigns and recent benchmarks, covering 22 task categories such as sentiment and emotion analysis, toxicity detection, and summarization. The benchmark addresses key limitations in current evaluation practices, such as the lack of linguistic diversity and static evaluation setups by enabling continual updates and community-driven model and dataset submissions moderated by a committee of experts. We evaluate 23 LLMs ranging from 100 million to 14 billion parameters and provide empirical insights into their strengths and limitations. Our findings indicate that (i) LLMs perform worse on industry-relevant tasks than in fundamental ones, (ii) performance is on average lower for Galician and Basque, (iii) some tasks show results close to random, and (iv) in other tasks LLMs perform above random but below shared task systems. IberBench offers open-source implementations for the entire evaluation pipeline, including dataset normalization and hosting, incremental evaluation of LLMs, and a publicly accessible leaderboard.

IberBench：イベリア言語における大規模言語モデルの評価

IberBench: LLM Evaluation on Iberian Languages

要旨

Support