言語モデル訓練のためのデータ効率性

要旨

データは言語モデル（LM）の訓練において基本的な要素である。近年の研究は、最小限または最適な訓練データのサブセットを選択することで性能を最大化することを目指すデータ効率に焦点を当てている。この分野では、データフィルタリング、サンプリング、および選択といった技術が重要な役割を果たしている。これを補完するために、我々は訓練データの組織化を最適化することで性能を最大化する「データ有効性（Data Efficacy）」を定義し、これは比較的未開拓の領域である。本論文では、LM訓練におけるデータ有効性を考慮するための一般的なパラダイムであるDELTを提案し、訓練データの組織化の重要性を強調する。DELTは、データスコアリング、データ選択、およびデータ順序付けの3つのコンポーネントから構成される。これらのコンポーネントの中でも、我々は勾配一貫性の観点から各データサンプルの学習可能性と品質を考慮する新しいデータスコアリング手法として、Learnability-Quality Scoring（LQS）を設計した。また、モデルの忘却やデータ分布の偏りといった問題に対処する新しいデータ順序付け手法として、Folding Ordering（FO）を考案した。包括的な実験により、LM訓練におけるデータ有効性が検証され、以下のことが示された。第一に、提案されたDELTの様々なインスタンスは、データ規模やモデルサイズを増やすことなく、LMの性能を様々な程度で向上させる。第二に、これらのインスタンスの中でも、我々が提案したLQSによるデータスコアリングとFoldingによるデータ順序付けの組み合わせが最も顕著な改善をもたらす。最後に、データ選択を適用することで、データ有効性とデータ効率を同時に達成できる。したがって、我々はデータ有効性がLM訓練における有望な基礎領域であると考える。

English

Data is fundamental to the training of language models (LM). Recent research has been dedicated to data efficiency, which aims to maximize performance by selecting a minimal or optimal subset of training data. Techniques such as data filtering, sampling, and selection play a crucial role in this area. To complement it, we define Data Efficacy, which focuses on maximizing performance by optimizing the organization of training data and remains relatively underexplored. This work introduces a general paradigm, DELT, for considering data efficacy in LM training, which highlights the significance of training data organization. DELT comprises three components: Data Scoring, Data Selection, and Data Ordering. Among these components, we design Learnability-Quality Scoring (LQS), as a new instance of Data Scoring, which considers both the learnability and quality of each data sample from the gradient consistency perspective. We also devise Folding Ordering (FO), as a novel instance of Data Ordering, which addresses issues such as model forgetting and data distribution bias. Comprehensive experiments validate the data efficacy in LM training, which demonstrates the following: Firstly, various instances of the proposed DELT enhance LM performance to varying degrees without increasing the data scale and model size. Secondly, among these instances, the combination of our proposed LQS for data scoring and Folding for data ordering achieves the most significant improvement. Lastly, data efficacy can be achieved together with data efficiency by applying data selection. Therefore, we believe that data efficacy is a promising foundational area in LM training.

言語モデル訓練のためのデータ効率性

Data Efficacy for Language Model Training

要旨

Support