ChatPaper.aiChatPaper

OpenLID-v3:提升近缘语言识别精度的实践报告

OpenLID-v3: Improving the Precision of Closely Related Language Identification -- An Experience Report

February 13, 2026
作者: Mariia Fedorova, Nikolay Arefyev, Maja Buljan, Jindřich Helcl, Stephan Oepen, Egil Rønningstad, Yves Scherrer
cs.AI

摘要

语言识别(LID)是从网络数据构建高质量多语言数据集的关键步骤。现有LID工具(如OpenLID或GlotLID)常难以识别密切相关的语言,且无法有效区分有效自然语言与噪声,这污染了特定语言的子集,尤其对低资源语言影响显著。本研究通过增加训练数据、合并易混淆的语言变体集群、引入噪声专用标记,扩展了OpenLID分类器。我们将该升级系统命名为OpenLID-v3,并在多个基准测试中与GlotLID进行对比评估。开发过程中,我们重点关注三组密切关联语言(波斯尼亚语、克罗地亚语和塞尔维亚语;意大利北部与法国南部的罗曼语变体;斯堪的纳维亚语言),并在现有数据集不足的情况下贡献了新的评估数据集。研究发现,集成方法虽能提升精确度,但会显著降低对低资源语言的覆盖范围。OpenLID-v3已在https://huggingface.co/HPLT/OpenLID-v3 开放获取。
English
Language identification (LID) is an essential step in building high-quality multilingual datasets from web data. Existing LID tools (such as OpenLID or GlotLID) often struggle to identify closely related languages and to distinguish valid natural language from noise, which contaminates language-specific subsets, especially for low-resource languages. In this work we extend the OpenLID classifier by adding more training data, merging problematic language variant clusters, and introducing a special label for marking noise. We call this extended system OpenLID-v3 and evaluate it against GlotLID on multiple benchmarks. During development, we focus on three groups of closely related languages (Bosnian, Croatian, and Serbian; Romance varieties of Northern Italy and Southern France; and Scandinavian languages) and contribute new evaluation datasets where existing ones are inadequate. We find that ensemble approaches improve precision but also substantially reduce coverage for low-resource languages. OpenLID-v3 is available on https://huggingface.co/HPLT/OpenLID-v3.
PDF02February 17, 2026