ChatPaper.aiChatPaper

选择:图像分类数据整理策略的大规模基准测试

SELECT: A Large-Scale Benchmark of Data Curation Strategies for Image Classification

October 7, 2024
作者: Benjamin Feuer, Jiawei Xu, Niv Cohen, Patrick Yubeaton, Govind Mittal, Chinmay Hegde
cs.AI

摘要

数据整理是如何收集和组织样本以支持高效学习的问题。尽管这项任务至关重要,但很少有工作致力于对各种整理方法进行大规模系统比较。在这项工作中,我们迈出了正式评估数据整理策略的步伐,并推出了SELECT,这是首个用于图像分类的大规模整理策略基准测试。 为了为SELECT基准测试生成基准方法,我们创建了一个新数据集ImageNet++,这是迄今为止最大的ImageNet-1K的超级集。我们的数据集通过5种新的训练数据偏移扩展了ImageNet,每种偏移大约与ImageNet-1K本身的大小相当,并且每种都是使用不同的整理策略组装而成。我们以两种方式评估我们的数据整理基准线:(i) 使用每种训练数据偏移来从头开始训练相同的图像分类模型 (ii) 使用数据本身来拟合预训练的自监督表示。 我们的研究结果显示了一些有趣的趋势,特别是与数据整理的最新方法有关,例如合成数据生成和基于CLIP嵌入的查找。我们发现,尽管这些策略在某些任务上具有很高的竞争力,但用于组装原始ImageNet-1K数据集的整理策略仍然是金标准。我们期待我们的基准测试可以为新方法开辟道路,进一步缩小差距。我们在https://github.com/jimmyxu123/SELECT 上发布了我们的检查点、代码、文档和数据集链接。
English
Data curation is the problem of how to collect and organize samples into a dataset that supports efficient learning. Despite the centrality of the task, little work has been devoted towards a large-scale, systematic comparison of various curation methods. In this work, we take steps towards a formal evaluation of data curation strategies and introduce SELECT, the first large-scale benchmark of curation strategies for image classification. In order to generate baseline methods for the SELECT benchmark, we create a new dataset, ImageNet++, which constitutes the largest superset of ImageNet-1K to date. Our dataset extends ImageNet with 5 new training-data shifts, each approximately the size of ImageNet-1K itself, and each assembled using a distinct curation strategy. We evaluate our data curation baselines in two ways: (i) using each training-data shift to train identical image classification models from scratch (ii) using the data itself to fit a pretrained self-supervised representation. Our findings show interesting trends, particularly pertaining to recent methods for data curation such as synthetic data generation and lookup based on CLIP embeddings. We show that although these strategies are highly competitive for certain tasks, the curation strategy used to assemble the original ImageNet-1K dataset remains the gold standard. We anticipate that our benchmark can illuminate the path for new methods to further reduce the gap. We release our checkpoints, code, documentation, and a link to our dataset at https://github.com/jimmyxu123/SELECT.

Summary

AI-Generated Summary

PDF72November 16, 2024