CulturaX：一個包含167種語言的乾淨、龐大且多語言的資料集，適用於大型語言模型

摘要

發展大型語言模型（LLMs）並具有令人印象深刻的學習能力的驅動因素在於它們龐大的模型大小和龐大的訓練數據集。隨著自然語言處理的進展，LLMs經常被公開提供以促進更深入的研究和應用。然而，就這些LLMs的訓練數據集，特別是最近的最先進模型而言，通常並未完全披露。為高性能LLMs創建訓練數據涉及廣泛的清理和去重，以確保必要的質量水平。對於訓練數據的透明度因此阻礙了對LLMs中幻覺和偏見問題的歸因和解決的研究，阻礙了複製努力和社區進一步進展。這些挑戰在多語言學習場景中變得更加突出，其中可用的多語言文本數據集通常收集和清理不足。因此，在有效訓練多語言LLMs方面缺乏開源且可立即使用的數據集。為解決這個問題，我們提出了CulturaX，一個包含167種語言、總共63萬億標記的大型多語言數據集，專為LLM開發而設。我們的數據集通過多個階段的嚴格流程進行細緻的清理和去重，以確保模型訓練的最佳質量，包括語言識別、基於URL的過濾、基於指標的清理、文件精煉和數據去重。CulturaX已完全向公眾開放，以促進多語言LLMs的研究和進展：https://huggingface.co/datasets/uonlp/CulturaX。

English

The driving factors behind the development of large language models (LLMs) with impressive learning capabilities are their colossal model sizes and extensive training datasets. Along with the progress in natural language processing, LLMs have been frequently made accessible to the public to foster deeper investigation and applications. However, when it comes to training datasets for these LLMs, especially the recent state-of-the-art models, they are often not fully disclosed. Creating training data for high-performing LLMs involves extensive cleaning and deduplication to ensure the necessary level of quality. The lack of transparency for training data has thus hampered research on attributing and addressing hallucination and bias issues in LLMs, hindering replication efforts and further advancements in the community. These challenges become even more pronounced in multilingual learning scenarios, where the available multilingual text datasets are often inadequately collected and cleaned. Consequently, there is a lack of open-source and readily usable dataset to effectively train LLMs in multiple languages. To overcome this issue, we present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages, tailored for LLM development. Our dataset undergoes meticulous cleaning and deduplication through a rigorous pipeline of multiple stages to accomplish the best quality for model training, including language identification, URL-based filtering, metric-based cleaning, document refinement, and data deduplication. CulturaX is fully released to the public in HuggingFace to facilitate research and advancements in multilingual LLMs: https://huggingface.co/datasets/uonlp/CulturaX.

CulturaX：一個包含167種語言的乾淨、龐大且多語言的資料集，適用於大型語言模型

CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages

摘要

Support