CulturaX:一个清洁、庞大且多语言的数据集,适用于包含167种语言的大型语言模型。
CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages
September 17, 2023
作者: Thuat Nguyen, Chien Van Nguyen, Viet Dac Lai, Hieu Man, Nghia Trung Ngo, Franck Dernoncourt, Ryan A. Rossi, Thien Huu Nguyen
cs.AI
摘要
推动大型语言模型(LLMs)发展具有令人印象深刻的学习能力的驱动因素是它们庞大的模型规模和广泛的训练数据集。随着自然语言处理的进展,LLMs经常被公开提供给公众以促进更深入的研究和应用。然而,对于这些LLMs的训练数据集,尤其是最近的最先进模型,它们通常没有完全披露。为高性能LLMs创建训练数据涉及广泛的清洗和去重,以确保必要的质量水平。训练数据的透明度不足因此阻碍了对LLMs中幻觉和偏见问题的归因和解决的研究,阻碍了复制努力和社区进一步发展。这些挑战在多语言学习场景中变得更加突出,其中可用的多语言文本数据集通常收集和清理不足。因此,缺乏开源和可立即使用的数据集有效地训练多语言LLMs。为了克服这一问题,我们提出了CulturaX,一个包含167种语言、总共6.3万亿标记的大规模多语言数据集,专为LLM开发量身定制。我们的数据集通过多个阶段的严格流程进行细致的清洗和去重,以实现模型训练的最佳质量,包括语言识别、基于URL的过滤、基于度量的清洗、文档细化和数据去重。CulturaX已完全向公众发布在HuggingFace上,以促进多语言LLMs的研究和进展:https://huggingface.co/datasets/uonlp/CulturaX。
English
The driving factors behind the development of large language models (LLMs)
with impressive learning capabilities are their colossal model sizes and
extensive training datasets. Along with the progress in natural language
processing, LLMs have been frequently made accessible to the public to foster
deeper investigation and applications. However, when it comes to training
datasets for these LLMs, especially the recent state-of-the-art models, they
are often not fully disclosed. Creating training data for high-performing LLMs
involves extensive cleaning and deduplication to ensure the necessary level of
quality. The lack of transparency for training data has thus hampered research
on attributing and addressing hallucination and bias issues in LLMs, hindering
replication efforts and further advancements in the community. These challenges
become even more pronounced in multilingual learning scenarios, where the
available multilingual text datasets are often inadequately collected and
cleaned. Consequently, there is a lack of open-source and readily usable
dataset to effectively train LLMs in multiple languages. To overcome this
issue, we present CulturaX, a substantial multilingual dataset with 6.3
trillion tokens in 167 languages, tailored for LLM development. Our dataset
undergoes meticulous cleaning and deduplication through a rigorous pipeline of
multiple stages to accomplish the best quality for model training, including
language identification, URL-based filtering, metric-based cleaning, document
refinement, and data deduplication. CulturaX is fully released to the public in
HuggingFace to facilitate research and advancements in multilingual LLMs:
https://huggingface.co/datasets/uonlp/CulturaX.