自监督学习的自动数据整理:基于聚类的方法
Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach
May 24, 2024
作者: Huy V. Vo, Vasil Khalidov, Timothée Darcet, Théo Moutakanni, Nikita Smetanin, Marc Szafraniec, Hugo Touvron, Camille Couprie, Maxime Oquab, Armand Joulin, Hervé Jégou, Patrick Labatut, Piotr Bojanowski
cs.AI
摘要
自监督特征是现代机器学习系统的基石。它们通常在数据集上进行预训练,这些数据集的构建和整理通常需要大量人力。这种手动过程存在一些类似于监督学习中遇到的限制,例如,众包选择数据成本高昂且耗时,阻碍了数据集规模的扩展。在这项工作中,我们考虑了自监督预训练高质量数据集的自动整理问题。我们认为这些数据集应该是大型、多样化和平衡的,并提出了一种基于聚类的方法来构建满足所有这些标准的数据集。我们的方法涉及在大型多样化数据存储库上连续和分层应用k均值,以获得在数据概念之间均匀分布的簇,然后从这些簇中进行分层平衡抽样步骤。对包括基于网络的图像、卫星图像和文本在内的三个不同数据领域进行了大量实验,结果表明,在我们自动整理的数据集上训练的特征优于在未整理数据上训练的特征,同时与在手动整理数据上训练的特征相当或更好。
English
Self-supervised features are the cornerstone of modern machine learning
systems. They are typically pre-trained on data collections whose construction
and curation typically require extensive human effort. This manual process has
some limitations similar to those encountered in supervised learning, e.g., the
crowd-sourced selection of data is costly and time-consuming, preventing
scaling the dataset size. In this work, we consider the problem of automatic
curation of high-quality datasets for self-supervised pre-training. We posit
that such datasets should be large, diverse and balanced, and propose a
clustering-based approach for building ones satisfying all these criteria. Our
method involves successive and hierarchical applications of k-means on a
large and diverse data repository to obtain clusters that distribute uniformly
among data concepts, followed by a hierarchical, balanced sampling step from
these clusters. Extensive experiments on three different data domains including
web-based images, satellite images and text show that features trained on our
automatically curated datasets outperform those trained on uncurated data while
being on par or better than ones trained on manually curated data.