CulturaX：167言語に対応した大規模言語モデル向けの洗練された巨大な多言語データセット

要旨

大規模言語モデル（LLM）の開発を推進する要因は、その膨大なモデルサイズと広範なトレーニングデータセットにあります。自然言語処理の進展に伴い、LLMはより深い調査と応用を促進するために頻繁に一般公開されてきました。しかし、特に最近の最先端モデルにおけるLLMのトレーニングデータセットに関しては、完全に公開されていないことが多いです。高性能なLLMのためのトレーニングデータを作成するには、必要な品質を確保するために大規模なクリーニングと重複排除が行われます。トレーニングデータの透明性の欠如は、LLMにおける幻覚やバイアスの問題の原因究明と対処に関する研究を妨げ、再現性の取り組みやコミュニティのさらなる進展を阻害しています。これらの課題は、多言語学習シナリオにおいてさらに顕著になります。利用可能な多言語テキストデータセットは、しばしば不十分に収集・クリーニングされているためです。その結果、複数の言語で効果的にLLMをトレーニングするためのオープンソースで即座に利用可能なデータセットが不足しています。この問題を克服するため、我々はCulturaXを提示します。これは167言語で6.3兆トークンからなる大規模な多言語データセットであり、LLM開発に特化しています。我々のデータセットは、言語識別、URLベースのフィルタリング、メトリックベースのクリーニング、ドキュメントの精緻化、データの重複排除を含む、複数段階の厳格なパイプラインを通じて、モデルトレーニングに最適な品質を達成するために徹底的にクリーニングと重複排除が行われています。CulturaXは、多言語LLMの研究と進展を促進するためにHuggingFaceで完全に公開されています：https://huggingface.co/datasets/uonlp/CulturaX。

English

The driving factors behind the development of large language models (LLMs) with impressive learning capabilities are their colossal model sizes and extensive training datasets. Along with the progress in natural language processing, LLMs have been frequently made accessible to the public to foster deeper investigation and applications. However, when it comes to training datasets for these LLMs, especially the recent state-of-the-art models, they are often not fully disclosed. Creating training data for high-performing LLMs involves extensive cleaning and deduplication to ensure the necessary level of quality. The lack of transparency for training data has thus hampered research on attributing and addressing hallucination and bias issues in LLMs, hindering replication efforts and further advancements in the community. These challenges become even more pronounced in multilingual learning scenarios, where the available multilingual text datasets are often inadequately collected and cleaned. Consequently, there is a lack of open-source and readily usable dataset to effectively train LLMs in multiple languages. To overcome this issue, we present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages, tailored for LLM development. Our dataset undergoes meticulous cleaning and deduplication through a rigorous pipeline of multiple stages to accomplish the best quality for model training, including language identification, URL-based filtering, metric-based cleaning, document refinement, and data deduplication. CulturaX is fully released to the public in HuggingFace to facilitate research and advancements in multilingual LLMs: https://huggingface.co/datasets/uonlp/CulturaX.

CulturaX：167言語に対応した大規模言語モデル向けの洗練された巨大な多言語データセット

CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages

要旨

Support