语言模型训练中的数据效能
Data Efficacy for Language Model Training
June 26, 2025
作者: Yalun Dai, Yangyu Huang, Xin Zhang, Wenshan Wu, Chong Li, Wenhui Lu, Shijie Cao, Li Dong, Scarlett Li
cs.AI
摘要
数据是语言模型(LM)训练的基础。近期研究致力于数据效率,旨在通过选择最小或最优的训练数据子集来最大化性能。数据过滤、采样和选择等技术在这一领域发挥着关键作用。作为补充,我们定义了数据效能,其重点在于通过优化训练数据的组织来最大化性能,这一领域相对尚未充分探索。本文引入了一个通用范式DELT,用于在LM训练中考虑数据效能,强调了训练数据组织的重要性。DELT包含三个组成部分:数据评分、数据选择和数据排序。在这些组成部分中,我们设计了可学习性-质量评分(LQS),作为数据评分的一个新实例,它从梯度一致性的角度考虑了每个数据样本的可学习性和质量。我们还设计了折叠排序(FO),作为数据排序的一个新颖实例,解决了模型遗忘和数据分布偏差等问题。全面的实验验证了数据效能在LM训练中的有效性,展示了以下结果:首先,所提出的DELT的各种实例在不增加数据规模和模型大小的情况下,不同程度地提升了LM性能。其次,在这些实例中,我们提出的用于数据评分的LQS与用于数据排序的Folding相结合,实现了最显著的改进。最后,通过应用数据选择,数据效能可以与数据效率同时实现。因此,我们相信数据效能是LM训练中一个具有前景的基础领域。
English
Data is fundamental to the training of language models (LM). Recent research
has been dedicated to data efficiency, which aims to maximize performance by
selecting a minimal or optimal subset of training data. Techniques such as data
filtering, sampling, and selection play a crucial role in this area. To
complement it, we define Data Efficacy, which focuses on maximizing performance
by optimizing the organization of training data and remains relatively
underexplored. This work introduces a general paradigm, DELT, for considering
data efficacy in LM training, which highlights the significance of training
data organization. DELT comprises three components: Data Scoring, Data
Selection, and Data Ordering. Among these components, we design
Learnability-Quality Scoring (LQS), as a new instance of Data Scoring, which
considers both the learnability and quality of each data sample from the
gradient consistency perspective. We also devise Folding Ordering (FO), as a
novel instance of Data Ordering, which addresses issues such as model
forgetting and data distribution bias. Comprehensive experiments validate the
data efficacy in LM training, which demonstrates the following: Firstly,
various instances of the proposed DELT enhance LM performance to varying
degrees without increasing the data scale and model size. Secondly, among these
instances, the combination of our proposed LQS for data scoring and Folding for
data ordering achieves the most significant improvement. Lastly, data efficacy
can be achieved together with data efficiency by applying data selection.
Therefore, we believe that data efficacy is a promising foundational area in LM
training.