CulturaX: 167개 언어를 지원하는 대규모 언어 모델을 위한 정제된 방대한 다국어 데이터셋

초록

인상적인 학습 능력을 지닌 대규모 언어 모델(LLM)의 개발을 주도하는 요인은 그 거대한 모델 크기와 방대한 학습 데이터셋입니다. 자연어 처리 분야의 발전과 함께, LLM은 보다 깊이 있는 연구와 응용을 촉진하기 위해 자주 공개되어 왔습니다. 그러나 이러한 LLM, 특히 최신 최첨단 모델들의 학습 데이터셋은 종종 완전히 공개되지 않는 경우가 많습니다. 고성능 LLM을 위한 학습 데이터를 생성하기 위해서는 필요한 수준의 품질을 보장하기 위해 광범위한 정제 및 중복 제거 작업이 필요합니다. 학습 데이터의 투명성 부족은 LLM의 환각(hallucination) 및 편향(bias) 문제를 규명하고 해결하기 위한 연구를 저해하며, 재현(replication) 노력과 커뮤니티의 추가 발전을 방해하고 있습니다. 이러한 문제는 사용 가능한 다국어 텍스트 데이터셋이 종종 부적절하게 수집되고 정제되는 다국어 학습 시나리오에서 더욱 두드러집니다. 결과적으로, 다국어로 LLM을 효과적으로 학습시키기 위한 오픈소스 및 즉시 사용 가능한 데이터셋이 부족한 실정입니다. 이 문제를 해결하기 위해, 우리는 LLM 개발을 위해 맞춤화된 167개 언어로 구성된 6.3조 토큰의 방대한 다국어 데이터셋인 CulturaX를 제안합니다. 우리의 데이터셋은 언어 식별, URL 기반 필터링, 메트릭 기반 정제, 문서 개선, 데이터 중복 제거를 포함한 다단계의 엄격한 파이프라인을 통해 세심하게 정제 및 중복 제거되어 모델 학습을 위한 최상의 품질을 달성합니다. CulturaX는 다국어 LLM 연구 및 발전을 촉진하기 위해 HuggingFace에 완전히 공개되었습니다: https://huggingface.co/datasets/uonlp/CulturaX.

English

The driving factors behind the development of large language models (LLMs) with impressive learning capabilities are their colossal model sizes and extensive training datasets. Along with the progress in natural language processing, LLMs have been frequently made accessible to the public to foster deeper investigation and applications. However, when it comes to training datasets for these LLMs, especially the recent state-of-the-art models, they are often not fully disclosed. Creating training data for high-performing LLMs involves extensive cleaning and deduplication to ensure the necessary level of quality. The lack of transparency for training data has thus hampered research on attributing and addressing hallucination and bias issues in LLMs, hindering replication efforts and further advancements in the community. These challenges become even more pronounced in multilingual learning scenarios, where the available multilingual text datasets are often inadequately collected and cleaned. Consequently, there is a lack of open-source and readily usable dataset to effectively train LLMs in multiple languages. To overcome this issue, we present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages, tailored for LLM development. Our dataset undergoes meticulous cleaning and deduplication through a rigorous pipeline of multiple stages to accomplish the best quality for model training, including language identification, URL-based filtering, metric-based cleaning, document refinement, and data deduplication. CulturaX is fully released to the public in HuggingFace to facilitate research and advancements in multilingual LLMs: https://huggingface.co/datasets/uonlp/CulturaX.

CulturaX: 167개 언어를 지원하는 대규모 언어 모델을 위한 정제된 방대한 다국어 데이터셋

CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages

초록

Support