スケールを超えて：データ品質指標としての多様性係数 LLMが形式的に多様なデータで事前学習されていることを実証

要旨

現在、高性能な大規模言語モデル（LLM）を事前学習するためのトレンドは、主にモデルとデータセットの規模の拡大に焦点を当てています。しかし、事前学習データの質は強力なLLMを訓練する上で重要な要素でありながら、まだ十分に特徴付けられていない曖昧な概念です。そこで、我々は最近提案されたTask2Vec多様性係数を使用して、データ品質の形式的な側面を理解し、単なる規模を超えた分析を行います。具体的には、公開されている事前学習データセットの多様性係数を測定し、それらの形式的な多様性が理論的な下限と上限と比較して高いことを示します。さらに、多様性係数の信頼性を高めるために、解釈可能性の実験を行い、この係数が多様性の直感的な特性（例えば、潜在的な概念の数が増えると係数が増加するなど）と一致することを確認します。我々は、多様性係数が信頼できるものであり、公開されているLLMデータセットにおいて高い値を示すことを結論付け、この係数がLLMのための有用な多様なデータセットを構築するために使用できると推測します。

English

Current trends to pre-train capable Large Language Models (LLMs) mostly focus on scaling of model and dataset size. However, the quality of pre-training data is an important factor for training powerful LLMs, yet it is a nebulous concept that has not been fully characterized. Therefore, we use the recently proposed Task2Vec diversity coefficient to ground and understand formal aspects of data quality, to go beyond scale alone. Specifically, we measure the diversity coefficient of publicly available pre-training datasets to demonstrate that their formal diversity is high when compared to theoretical lower and upper bounds. In addition, to build confidence in the diversity coefficient, we conduct interpretability experiments and find that the coefficient aligns with intuitive properties of diversity, e.g., it increases as the number of latent concepts increases. We conclude the diversity coefficient is reliable, show it's high for publicly available LLM datasets, and conjecture it can be used to build useful diverse datasets for LLMs.

スケールを超えて：データ品質指標としての多様性係数 LLMが形式的に多様なデータで事前学習されていることを実証

Beyond Scale: the Diversity Coefficient as a Data Quality Metric Demonstrates LLMs are Pre-trained on Formally Diverse Data

要旨

Support