我的大数据中有什么？

摘要

大型文本语料库是语言模型的基础。然而，我们对这些语料库的内容，包括一般统计数据、质量、社会因素和包含的评估数据（污染）了解有限。在这项工作中，我们提出了“我的大数据里有什么？”（WIMBD），这是一个平台和一组十六项分析，可以帮助我们揭示和比较大型文本语料库的内容。WIMBD基于两种基本能力——计数和搜索——在规模上进行构建，这使我们能够在标准计算节点上分析超过35 TB的数据。我们将WIMBD应用于用于训练流行语言模型的十个不同语料库，包括C4、The Pile和RedPajama。我们的分析揭示了关于这些语料库的一些令人惊讶且以前未记录的发现，包括重复、合成和低质量内容的高普遍性、个人可识别信息、有毒语言和基准污染。例如，我们发现RedPajama和LAION-2B-en中约50%的文档是重复的。此外，用于对训练在这些语料库上的模型进行基准测试的几个数据集在重要基准测试方面存在污染，包括Winograd Schema Challenge以及GLUE和SuperGLUE的部分内容。我们开源了WIMBD的代码和工件，以提供新的基于文本的语料库的标准评估，并鼓励对其进行更多分析和透明度：github.com/allenai/wimbd。

English

Large text corpora are the backbone of language models. However, we have a limited understanding of the content of these corpora, including general statistics, quality, social factors, and inclusion of evaluation data (contamination). In this work, we propose What's In My Big Data? (WIMBD), a platform and a set of sixteen analyses that allow us to reveal and compare the contents of large text corpora. WIMBD builds on two basic capabilities -- count and search -- at scale, which allows us to analyze more than 35 terabytes on a standard compute node. We apply WIMBD to ten different corpora used to train popular language models, including C4, The Pile, and RedPajama. Our analysis uncovers several surprising and previously undocumented findings about these corpora, including the high prevalence of duplicate, synthetic, and low-quality content, personally identifiable information, toxic language, and benchmark contamination. For instance, we find that about 50% of the documents in RedPajama and LAION-2B-en are duplicates. In addition, several datasets used for benchmarking models trained on such corpora are contaminated with respect to important benchmarks, including the Winograd Schema Challenge and parts of GLUE and SuperGLUE. We open-source WIMBD's code and artifacts to provide a standard set of evaluations for new text-based corpora and to encourage more analyses and transparency around them: github.com/allenai/wimbd.