私のビッグデータには何が含まれているのか？

要旨

大規模なテキストコーパスは言語モデルの基盤である。しかし、これらのコーパスの内容、一般的な統計、品質、社会的要因、評価データの混入（汚染）などについては、理解が限られている。本研究では、大規模テキストコーパスの内容を明らかにし比較するためのプラットフォームおよび16の分析手法を提案する「What's In My Big Data? (WIMBD)」を紹介する。WIMBDは、大規模なカウントと検索という2つの基本機能を基盤としており、標準的なコンピュートノードで35テラバイト以上のデータを分析することが可能である。我々はWIMBDを、C4、The Pile、RedPajamaなど、人気のある言語モデルのトレーニングに使用される10の異なるコーパスに適用した。この分析により、これらのコーパスに関するいくつかの驚くべき、かつこれまで文書化されていない発見が明らかになった。具体的には、重複、合成、低品質なコンテンツ、個人識別情報、有害な言語、ベンチマークの汚染が高い頻度で存在することが判明した。例えば、RedPajamaとLAION-2B-enのドキュメントの約50%が重複していることがわかった。さらに、このようなコーパスでトレーニングされたモデルのベンチマークに使用されるいくつかのデータセットは、Winograd Schema ChallengeやGLUE、SuperGLUEの一部を含む重要なベンチマークに関して汚染されている。我々はWIMBDのコードと成果物をオープンソース化し、新しいテキストベースのコーパスに対する標準的な評価セットを提供し、それらに関するさらなる分析と透明性を促進する：github.com/allenai/wimbd。

English

Large text corpora are the backbone of language models. However, we have a limited understanding of the content of these corpora, including general statistics, quality, social factors, and inclusion of evaluation data (contamination). In this work, we propose What's In My Big Data? (WIMBD), a platform and a set of sixteen analyses that allow us to reveal and compare the contents of large text corpora. WIMBD builds on two basic capabilities -- count and search -- at scale, which allows us to analyze more than 35 terabytes on a standard compute node. We apply WIMBD to ten different corpora used to train popular language models, including C4, The Pile, and RedPajama. Our analysis uncovers several surprising and previously undocumented findings about these corpora, including the high prevalence of duplicate, synthetic, and low-quality content, personally identifiable information, toxic language, and benchmark contamination. For instance, we find that about 50% of the documents in RedPajama and LAION-2B-en are duplicates. In addition, several datasets used for benchmarking models trained on such corpora are contaminated with respect to important benchmarks, including the Winograd Schema Challenge and parts of GLUE and SuperGLUE. We open-source WIMBD's code and artifacts to provide a standard set of evaluations for new text-based corpora and to encourage more analyses and transparency around them: github.com/allenai/wimbd.