自监督学习的自动数据整理：基于聚类的方法

摘要

自监督特征是现代机器学习系统的基石。它们通常在数据集上进行预训练，这些数据集的构建和整理通常需要大量人力。这种手动过程存在一些类似于监督学习中遇到的限制，例如，众包选择数据成本高昂且耗时，阻碍了数据集规模的扩展。在这项工作中，我们考虑了自监督预训练高质量数据集的自动整理问题。我们认为这些数据集应该是大型、多样化和平衡的，并提出了一种基于聚类的方法来构建满足所有这些标准的数据集。我们的方法涉及在大型多样化数据存储库上连续和分层应用k均值，以获得在数据概念之间均匀分布的簇，然后从这些簇中进行分层平衡抽样步骤。对包括基于网络的图像、卫星图像和文本在内的三个不同数据领域进行了大量实验，结果表明，在我们自动整理的数据集上训练的特征优于在未整理数据上训练的特征，同时与在手动整理数据上训练的特征相当或更好。

English

Self-supervised features are the cornerstone of modern machine learning systems. They are typically pre-trained on data collections whose construction and curation typically require extensive human effort. This manual process has some limitations similar to those encountered in supervised learning, e.g., the crowd-sourced selection of data is costly and time-consuming, preventing scaling the dataset size. In this work, we consider the problem of automatic curation of high-quality datasets for self-supervised pre-training. We posit that such datasets should be large, diverse and balanced, and propose a clustering-based approach for building ones satisfying all these criteria. Our method involves successive and hierarchical applications of k-means on a large and diverse data repository to obtain clusters that distribute uniformly among data concepts, followed by a hierarchical, balanced sampling step from these clusters. Extensive experiments on three different data domains including web-based images, satellite images and text show that features trained on our automatically curated datasets outperform those trained on uncurated data while being on par or better than ones trained on manually curated data.

自监督学习的自动数据整理：基于聚类的方法

Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach

摘要

Support