ChatPaper.aiChatPaper

FiNERweb:面向可擴展多語言命名體識別的資料集與工具集

FiNERweb: Datasets and Artifacts for Scalable Multilingual Named Entity Recognition

December 15, 2025
作者: Jonas Golde, Patrick Haller, Alan Akbik
cs.AI

摘要

近期多语言命名实体识别研究显示,大型语言模型能提供有效的合成监督数据,但这类数据集大多作为广泛实验的副产品出现,而非系统化、可复用的资源。我们推出FiNERweb数据集构建流程,将师生范式扩展至91种语言和25种文字体系。基于FineWeb-Edu框架,该方法训练回归模型识别NER相关文本段落,并通过多语言LLM进行标注,最终生成约22.5万段文本、包含23.5万个独立实体标签。实验表明:回归模型F1值超过84%;使用FiNERweb训练的模型在英语、泰语和斯瓦希里语的零样本迁移场景中,仅用强基线1/19的数据量即获得相当或更优性能。通过LLM即评判员的质量评估显示,标注的忠实度(3.99/5)与完整度(4.05/5)持续保持高分,表明标注结果可靠且信息丰富。鉴于当前最优模型使用目标语言标签评估时F1值会下降0.02至0.09,我们同时发布含英文标签及目标语言翻译标签的数据集。现向学界开放FiNERweb及全部配套资源,以促进多语言命名实体识别领域更高效的师生训练范式发展。
English
Recent multilingual named entity recognition (NER) work has shown that large language models (LLMs) can provide effective synthetic supervision, yet such datasets have mostly appeared as by-products of broader experiments rather than as systematic, reusable resources. We introduce FiNERweb, a dataset-creation pipeline that scales the teacher-student paradigm to 91 languages and 25 scripts. Building on FineWeb-Edu, our approach trains regression models to identify NER-relevant passages and annotates them with multilingual LLMs, resulting in about 225k passages with 235k distinct entity labels. Our experiments show that the regression model achieves more than 84 F1, and that models trained on FiNERweb obtain comparable or improved performance in zero shot transfer settings on English, Thai, and Swahili, despite being trained on 19x less data than strong baselines. In addition, we assess annotation quality using LLM-as-a-judge and observe consistently high scores for both faithfulness (3.99 out of 5) and completeness (4.05 out of 5), indicating reliable and informative annotations. Further, we release the dataset with both English labels and translated label sets in the respective target languages because we observe that the performance of current state-of-the-art models drops by 0.02 to 0.09 F1 when evaluated using target language labels instead of English ones. We release FiNERweb together with all accompanying artifacts to the research community in order to facilitate more effective student-teacher training for multilingual named entity recognition.
PDF122December 19, 2025