FIN-bench-v2:面向芬兰语大语言模型评估的统一鲁棒性基准测试套件
FIN-bench-v2: A Unified and Robust Benchmark Suite for Evaluating Finnish Large Language Models
December 15, 2025
作者: Joona Kytöniemi, Jousia Piha, Akseli Reunamo, Fedor Vitiugin, Farrokh Mehryary, Sampo Pyysalo
cs.AI
摘要
我们推出FIN-bench-v2,这是一个用于评估芬兰语大型语言模型的统一基准测试套件。该套件将广泛使用的基准测试芬兰语版本与原始FIN-bench的升级扩展版整合为格式统一的集合,涵盖阅读理解、常识推理、情感分析、世界知识和对齐性等多选题与生成式任务。所有数据集均转换为HuggingFace Datasets格式,包含完形填空和多项选择题两种提示模板(每项任务设五种变体),并对机器翻译资源(如GoldenSwag和XED)进行了人工标注或审核。为筛选稳健任务,我们预训练了一组21.5亿参数的仅解码器模型,通过其学习曲线计算单调性、信噪比、非随机性能及模型排序一致性,仅保留满足所有标准的任务。我们还评估了若干大型指令微调模型,以刻画不同任务和提示模板下的性能表现。所有数据集、提示模板和评估配置均通过我们分叉的"语言模型评估工具库"(https://github.com/LumiOpen/lm-evaluation-harness)公开。补充资源发布于独立代码库(https://github.com/TurkuNLP/FIN-bench-v2)。
English
We introduce FIN-bench-v2, a unified benchmark suite for evaluating large language models in Finnish. FIN-bench-v2 consolidates Finnish versions of widely used benchmarks together with an updated and expanded version of the original FIN-bench into a single, consistently formatted collection, covering multiple-choice and generative tasks across reading comprehension, commonsense reasoning, sentiment analysis, world knowledge, and alignment. All datasets are converted to HuggingFace Datasets, which include both cloze and multiple-choice prompt formulations with five variants per task, and we incorporate human annotation or review for machine-translated resources such as GoldenSwag and XED. To select robust tasks, we pretrain a set of 2.15B-parameter decoder-only models and use their learning curves to compute monotonicity, signal-to-noise, non-random performance, and model ordering consistency, retaining only tasks that satisfy all criteria. We further evaluate a set of larger instruction-tuned models to characterize performance across tasks and prompt formulations. All datasets, prompts, and evaluation configurations are publicly available via our fork of the Language Model Evaluation Harness at https://github.com/LumiOpen/lm-evaluation-harness. Supplementary resources are released in a separate repository at https://github.com/TurkuNLP/FIN-bench-v2.