내 빅데이터에는 무엇이 들어 있을까?

초록

대규모 텍스트 코퍼스는 언어 모델의 핵심 기반입니다. 그러나 이러한 코퍼스의 내용, 즉 일반적인 통계, 품질, 사회적 요인, 평가 데이터 포함 여부(오염) 등에 대한 이해는 제한적입니다. 본 연구에서는 대규모 텍스트 코퍼스의 내용을 밝히고 비교할 수 있는 플랫폼 및 16가지 분석 도구인 'What's In My Big Data?'(WIMBD)를 제안합니다. WIMBD는 대규모 데이터에 대한 '계수'와 '검색'이라는 두 가지 기본 기능을 기반으로 하여, 표준 컴퓨팅 노드에서 35테라바이트 이상의 데이터를 분석할 수 있습니다. 우리는 WIMBD를 C4, The Pile, RedPajama 등 인기 있는 언어 모델 훈련에 사용된 10개의 서로 다른 코퍼스에 적용했습니다. 분석 결과, 이러한 코퍼스에서 중복 문서, 합성 콘텐츠, 저품질 콘텐츠, 개인 식별 정보, 유해 언어, 벤치마크 오염 등 여러 가지 놀랍고 이전에 문서화되지 않은 사실들을 발견했습니다. 예를 들어, RedPajama와 LAION-2B-en 코퍼스의 문서 중 약 50%가 중복된 것으로 나타났습니다. 또한, 이러한 코퍼스로 훈련된 모델을 평가하는 데 사용된 여러 데이터셋이 Winograd Schema Challenge 및 GLUE와 SuperGLUE의 일부와 같은 중요한 벤치마크에 대해 오염된 것으로 확인되었습니다. 우리는 새로운 텍스트 기반 코퍼스에 대한 표준 평가 도구를 제공하고, 더 많은 분석과 투명성을 장려하기 위해 WIMBD의 코드와 아티팩트를 오픈소스로 공개합니다: github.com/allenai/wimbd.

English

Large text corpora are the backbone of language models. However, we have a limited understanding of the content of these corpora, including general statistics, quality, social factors, and inclusion of evaluation data (contamination). In this work, we propose What's In My Big Data? (WIMBD), a platform and a set of sixteen analyses that allow us to reveal and compare the contents of large text corpora. WIMBD builds on two basic capabilities -- count and search -- at scale, which allows us to analyze more than 35 terabytes on a standard compute node. We apply WIMBD to ten different corpora used to train popular language models, including C4, The Pile, and RedPajama. Our analysis uncovers several surprising and previously undocumented findings about these corpora, including the high prevalence of duplicate, synthetic, and low-quality content, personally identifiable information, toxic language, and benchmark contamination. For instance, we find that about 50% of the documents in RedPajama and LAION-2B-en are duplicates. In addition, several datasets used for benchmarking models trained on such corpora are contaminated with respect to important benchmarks, including the Winograd Schema Challenge and parts of GLUE and SuperGLUE. We open-source WIMBD's code and artifacts to provide a standard set of evaluations for new text-based corpora and to encourage more analyses and transparency around them: github.com/allenai/wimbd.