다국어 접근 방식을 통한 사전 학습 데이터 필터링: 언어 모델을 활용한 품질 평가

초록

고품질의 다국어 학습 데이터는 대규모 언어 모델(LLM)을 효과적으로 사전 학습하는 데 필수적입니다. 그러나 적합한 오픈소스 다국어 데이터셋의 가용성은 여전히 제한적입니다. 현재 최신 데이터셋은 대부분 휴리스틱 필터링 방법에 의존하고 있어, 교차 언어 전이성과 확장성 모두에 제약을 받고 있습니다. 본 연구에서는 JQL을 소개합니다. JQL은 계산 요구를 크게 줄이면서도 다양하고 고품질의 다국어 데이터를 대규모로 체계적으로 선별하는 접근법입니다. JQL은 사전 학습된 다국어 임베딩을 기반으로 한 경량 어노테이터에 LLM의 어노테이션 능력을 응축합니다. 이러한 모델은 학습 중에 접하지 못한 언어와 문자 체계에 대해서도 강력한 다국어 및 교차 언어 성능을 보여줍니다. 35개 언어에 걸쳐 실증적으로 평가한 결과, 이 어노테이션 파이프라인은 Fineweb2와 같은 현재의 휴리스틱 필터링 방법을 크게 능가했습니다. JQL은 특히 다운스트림 모델 학습 품질을 향상시키고 데이터 보존률을 증가시킵니다. 본 연구는 다국어 데이터 선별에 대한 실용적인 통찰과 가치 있는 자원을 제공함으로써 다국어 데이터셋 개발의 기준을 높입니다.

English

High-quality multilingual training data is essential for effectively pretraining large language models (LLMs). Yet, the availability of suitable open-source multilingual datasets remains limited. Existing state-of-the-art datasets mostly rely on heuristic filtering methods, restricting both their cross-lingual transferability and scalability. Here, we introduce JQL, a systematic approach that efficiently curates diverse and high-quality multilingual data at scale while significantly reducing computational demands. JQL distills LLMs' annotation capabilities into lightweight annotators based on pretrained multilingual embeddings. These models exhibit robust multilingual and cross-lingual performance, even for languages and scripts unseen during training. Evaluated empirically across 35 languages, the resulting annotation pipeline substantially outperforms current heuristic filtering methods like Fineweb2. JQL notably enhances downstream model training quality and increases data retention rates. Our research provides practical insights and valuable resources for multilingual data curation, raising the standards of multilingual dataset development.

다국어 접근 방식을 통한 사전 학습 데이터 필터링: 언어 모델을 활용한 품질 평가

Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models

초록

Support