GlotCC：少数言語向けのオープンな広範囲CommonCrawlコーパスとパイプライン

要旨

事前学習済み言語モデルの登場と、特にこれらのモデルに対するスケーリング則の発見により、大規模なテキストコーパスの必要性が高まっています。ほとんどの利用可能なコーパスは、大きな主要コミュニティを持つ言語にのみ十分なデータを有しています。しかし、(i) 多様な少数言語をカバーする、(ii) オープンソースの再現可能なパイプラインによって生成される、および (iii) ノイズから厳密にクリーニングされ信頼性のあるコーパスは存在しません。私たちは、CommonCrawlから派生した、1000以上の言語をカバーする、クリーンで文書レベルの2TBの一般ドメインコーパスであるGlotCCを提供します。GlotCCおよびそれを生成するために使用されたシステム - パイプライン、言語識別モデル、およびフィルターを、研究コミュニティに提供します。コーパス v. 1.0 https://huggingface.co/datasets/cis-lmu/GlotCC-v1、パイプライン v. 3.0 https://github.com/cisnlp/GlotCC。

English

The need for large text corpora has increased with the advent of pretrained language models and, in particular, the discovery of scaling laws for these models. Most available corpora have sufficient data only for languages with large dominant communities. However, there is no corpus available that (i) covers a wide range of minority languages; (ii) is generated by an open-source reproducible pipeline; and (iii) is rigorously cleaned from noise, making it trustworthy to use. We present GlotCC, a clean, document-level, 2TB general domain corpus derived from CommonCrawl, covering more than 1000 languages. We make GlotCC and the system used to generate it - including the pipeline, language identification model, and filters - available to the research community. Corpus v. 1.0 https://huggingface.co/datasets/cis-lmu/GlotCC-v1, Pipeline v. 3.0 https://github.com/cisnlp/GlotCC.