通过模拟模型权重来评估数据选择的样本效用

摘要

基础模型依赖于大规模网络抓取的数据集，这些数据集通常包含嘈杂数据、偏见和无关内容。现有的数据选择技术通常使用人类启发式方法、下游评估数据集或专门的评分模型，可能会忽略训练过程中样本的效用。相反，我们提出了一种新方法，即模仿评分（Mimic Score），这是一种数据质量度量标准，利用预训练的参考模型作为指导，评估数据样本对于训练新模型的有用性。它依赖于新模型参数的梯度与指向权重空间中参考模型的向量之间的对齐。与这个方向不一致的样本被认为是低价值的，可以被过滤掉。受模仿评分的启发，我们开发了Grad-Mimic，这是一个数据选择框架，可以识别和优先考虑有用的样本，自动化选择过程以创建有效的过滤器。从经验上看，使用模仿评分来指导模型训练在六个图像数据集上产生了一致的性能提升，并增强了CLIP模型的性能。此外，模仿评分及其相关的过滤器改进了现有的过滤方法，并提供了对数据集质量的准确估计。

English

Foundation models rely on large-scale web-crawled datasets, which frequently contain noisy data, biases, and irrelevant content. Existing data selection techniques typically use human heuristics, downstream evaluation datasets, or specialized scoring models, and can overlook samples' utility in the training process. Instead, we propose a new approach, Mimic Score, a data quality metric that uses a pretrained reference model as a guide to assess the usefulness of data samples for training a new model. It relies on the alignment between the gradient of the new model parameters and the vector pointing toward the reference model in weight space. Samples that misalign with this direction are considered low-value and can be filtered out. Motivated by the Mimic score, we develop Grad-Mimic, a data selection framework that identifies and prioritizes useful samples, automating the selection process to create effective filters. Empirically, using Mimic scores to guide model training results in consistent performance gains across six image datasets and enhances the performance of CLIP models. Moreover, Mimic scores and their associated filters improve upon existing filtering methods and offer accurate estimation of dataset quality.

通过模拟模型权重来评估数据选择的样本效用

Evaluating Sample Utility for Data Selection by Mimicking Model Weights

摘要

Support