我的大數據中有什麼？

摘要

龐大的文本語料庫是語言模型的支柱。然而，我們對這些語料庫的內容了解有限，包括一般統計數據、質量、社會因素，以及包含的評估數據（污染）。在這項工作中，我們提出了「我的大數據裡有什麼？」（WIMBD），這是一個平台和一組十六種分析方法，讓我們能夠揭示並比較大型文本語料庫的內容。WIMBD基於兩種基本能力 -- 計數和搜索 -- 以規模化的方式進行，這使我們能夠在標準計算節點上分析超過35TB的數據。我們將WIMBD應用於用於訓練流行語言模型的十個不同語料庫，包括C4、The Pile和RedPajama。我們的分析揭示了有關這些語料庫的幾個令人驚訝且以前未記錄的發現，包括重複、合成和低質量內容的高普及率、個人可識別信息、有毒語言，以及基準污染。例如，我們發現RedPajama和LAION-2B-en中約有50%的文檔是重複的。此外，用於基準模型的幾個數據集受到重要基準（包括Winograd Schema Challenge以及GLUE和SuperGLUE的部分）的污染。我們開源了WIMBD的代碼和藝術品，以提供新的基於文本的語料庫的標準評估，並鼓勵進行更多分析和透明度：github.com/allenai/wimbd。

English

Large text corpora are the backbone of language models. However, we have a limited understanding of the content of these corpora, including general statistics, quality, social factors, and inclusion of evaluation data (contamination). In this work, we propose What's In My Big Data? (WIMBD), a platform and a set of sixteen analyses that allow us to reveal and compare the contents of large text corpora. WIMBD builds on two basic capabilities -- count and search -- at scale, which allows us to analyze more than 35 terabytes on a standard compute node. We apply WIMBD to ten different corpora used to train popular language models, including C4, The Pile, and RedPajama. Our analysis uncovers several surprising and previously undocumented findings about these corpora, including the high prevalence of duplicate, synthetic, and low-quality content, personally identifiable information, toxic language, and benchmark contamination. For instance, we find that about 50% of the documents in RedPajama and LAION-2B-en are duplicates. In addition, several datasets used for benchmarking models trained on such corpora are contaminated with respect to important benchmarks, including the Winograd Schema Challenge and parts of GLUE and SuperGLUE. We open-source WIMBD's code and artifacts to provide a standard set of evaluations for new text-based corpora and to encourage more analyses and transparency around them: github.com/allenai/wimbd.