ChatPaper.aiChatPaper

跨语言质量评估:基于语言模型的多语言预训练数据过滤方法

Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models

May 28, 2025
作者: Mehdi Ali, Manuel Brack, Max Lübbering, Elias Wendt, Abbas Goher Khan, Richard Rutmann, Alex Jude, Maurice Kraus, Alexander Arno Weber, Felix Stollenwerk, David Kaczér, Florian Mai, Lucie Flek, Rafet Sifa, Nicolas Flores-Herr, Joachim Köhler, Patrick Schramowski, Michael Fromm, Kristian Kersting
cs.AI

摘要

高质量的多语言训练数据对于有效预训练大型语言模型(LLMs)至关重要。然而,合适的开源多语言数据集的可用性仍然有限。现有的顶尖数据集大多依赖于启发式过滤方法,这既限制了它们的跨语言迁移能力,也制约了其扩展性。在此,我们引入了JQL,一种系统化的方法,能够高效地大规模筛选出多样且高质量的多语言数据,同时显著降低计算需求。JQL将LLMs的标注能力提炼为基于预训练多语言嵌入的轻量级标注器。这些模型展现出强大的多语言和跨语言性能,即使对于训练过程中未见过的语言和文字体系也是如此。在35种语言上的实证评估表明,由此产生的标注流程显著超越了当前如Fineweb2等启发式过滤方法。JQL显著提升了下游模型训练的质量,并提高了数据保留率。我们的研究为多语言数据筛选提供了实用的见解和宝贵的资源,提升了多语言数据集开发的标准。
English
High-quality multilingual training data is essential for effectively pretraining large language models (LLMs). Yet, the availability of suitable open-source multilingual datasets remains limited. Existing state-of-the-art datasets mostly rely on heuristic filtering methods, restricting both their cross-lingual transferability and scalability. Here, we introduce JQL, a systematic approach that efficiently curates diverse and high-quality multilingual data at scale while significantly reducing computational demands. JQL distills LLMs' annotation capabilities into lightweight annotators based on pretrained multilingual embeddings. These models exhibit robust multilingual and cross-lingual performance, even for languages and scripts unseen during training. Evaluated empirically across 35 languages, the resulting annotation pipeline substantially outperforms current heuristic filtering methods like Fineweb2. JQL notably enhances downstream model training quality and increases data retention rates. Our research provides practical insights and valuable resources for multilingual data curation, raising the standards of multilingual dataset development.
PDF182May 29, 2025