超越规模：多样性系数作为数据质量度量标准展示了LLMs是在形式多样的数据上进行预训练

摘要

目前，预训练能力强大的大型语言模型（LLMs）的当前趋势主要集中在模型规模和数据集规模的扩展上。然而，预训练数据的质量是训练强大LLMs的重要因素，但这是一个尚未完全表征的模糊概念。因此，我们使用最近提出的Task2Vec多样性系数来确立和理解数据质量的形式方面，以超越单纯的规模。具体而言，我们测量公开可用的预训练数据集的多样性系数，以证明与理论下限和上限相比，它们的形式多样性较高。此外，为了增强对多样性系数的信心，我们进行可解释性实验，并发现该系数与多样性的直观特性一致，例如，随着潜在概念数量的增加，系数也会增加。我们得出结论，多样性系数是可靠的，表明其在公开可用的LLM数据集中较高，并推测可以用于构建LLMs的有用多样数据集。

English

Current trends to pre-train capable Large Language Models (LLMs) mostly focus on scaling of model and dataset size. However, the quality of pre-training data is an important factor for training powerful LLMs, yet it is a nebulous concept that has not been fully characterized. Therefore, we use the recently proposed Task2Vec diversity coefficient to ground and understand formal aspects of data quality, to go beyond scale alone. Specifically, we measure the diversity coefficient of publicly available pre-training datasets to demonstrate that their formal diversity is high when compared to theoretical lower and upper bounds. In addition, to build confidence in the diversity coefficient, we conduct interpretability experiments and find that the coefficient aligns with intuitive properties of diversity, e.g., it increases as the number of latent concepts increases. We conclude the diversity coefficient is reliable, show it's high for publicly available LLM datasets, and conjecture it can be used to build useful diverse datasets for LLMs.

超越规模：多样性系数作为数据质量度量标准展示了LLMs是在形式多样的数据上进行预训练

Beyond Scale: the Diversity Coefficient as a Data Quality Metric Demonstrates LLMs are Pre-trained on Formally Diverse Data

摘要

Support