語言模型訓練的數據效能

摘要

數據是語言模型（LM）訓練的基礎。近期研究致力於數據效率，旨在通過選擇最小或最優的訓練數據子集來最大化性能。數據過濾、採樣和選擇等技術在這一領域扮演著關鍵角色。作為補充，我們定義了數據效能，其重點在於通過優化訓練數據的組織來最大化性能，這一領域相對尚未充分探索。本研究引入了一個通用範式DELT，用於在LM訓練中考慮數據效能，強調了訓練數據組織的重要性。DELT包含三個組件：數據評分、數據選擇和數據排序。在這些組件中，我們設計了可學習性-質量評分（LQS），作為數據評分的一個新實例，它從梯度一致性的角度考慮了每個數據樣本的可學習性和質量。我們還設計了折疊排序（FO），作為數據排序的一個新實例，它解決了模型遺忘和數據分佈偏差等問題。全面的實驗驗證了數據效能在LM訓練中的有效性，展示了以下幾點：首先，提出的DELT的各種實例在不增加數據規模和模型大小的情況下，不同程度地提升了LM的性能。其次，在這些實例中，我們提出的LQS用於數據評分和折疊用於數據排序的組合實現了最顯著的改進。最後，通過應用數據選擇，數據效能可以與數據效率共同實現。因此，我們相信數據效能是LM訓練中一個有前景的基礎領域。

English

Data is fundamental to the training of language models (LM). Recent research has been dedicated to data efficiency, which aims to maximize performance by selecting a minimal or optimal subset of training data. Techniques such as data filtering, sampling, and selection play a crucial role in this area. To complement it, we define Data Efficacy, which focuses on maximizing performance by optimizing the organization of training data and remains relatively underexplored. This work introduces a general paradigm, DELT, for considering data efficacy in LM training, which highlights the significance of training data organization. DELT comprises three components: Data Scoring, Data Selection, and Data Ordering. Among these components, we design Learnability-Quality Scoring (LQS), as a new instance of Data Scoring, which considers both the learnability and quality of each data sample from the gradient consistency perspective. We also devise Folding Ordering (FO), as a novel instance of Data Ordering, which addresses issues such as model forgetting and data distribution bias. Comprehensive experiments validate the data efficacy in LM training, which demonstrates the following: Firstly, various instances of the proposed DELT enhance LM performance to varying degrees without increasing the data scale and model size. Secondly, among these instances, the combination of our proposed LQS for data scoring and Folding for data ordering achieves the most significant improvement. Lastly, data efficacy can be achieved together with data efficiency by applying data selection. Therefore, we believe that data efficacy is a promising foundational area in LM training.

語言模型訓練的數據效能

Data Efficacy for Language Model Training

摘要

Support