自我監督學習的自動數據整理：基於聚類的方法

摘要

自我監督特徵是現代機器學習系統的基石。它們通常在需要大量人力的數據集上進行預訓練，這些數據集的構建和整理通常需要大量人力。這種手動過程存在一些類似於監督學習中遇到的限制，例如，眾包選擇數據成本高昂且耗時，阻礙了數據集規模的擴展。在這項工作中，我們考慮了自我監督預訓練高質量數據集的自動整理問題。我們認為這些數據集應該是大規模、多樣且平衡的，並提出了一種基於聚類的方法來滿足所有這些標準。我們的方法涉及在大規模多樣數據庫上連續和分層應用k-means，以獲得均勻分佈在數據概念之間的聚類，然後從這些聚類中進行分層平衡抽樣步驟。對包括基於網絡的圖像、衛星圖像和文本在內的三個不同數據領域進行了大量實驗，結果顯示我們自動整理的數據集訓練的特徵優於在未整理數據上訓練的特徵，並且與在手動整理數據上訓練的特徵相當或更好。

English

Self-supervised features are the cornerstone of modern machine learning systems. They are typically pre-trained on data collections whose construction and curation typically require extensive human effort. This manual process has some limitations similar to those encountered in supervised learning, e.g., the crowd-sourced selection of data is costly and time-consuming, preventing scaling the dataset size. In this work, we consider the problem of automatic curation of high-quality datasets for self-supervised pre-training. We posit that such datasets should be large, diverse and balanced, and propose a clustering-based approach for building ones satisfying all these criteria. Our method involves successive and hierarchical applications of k-means on a large and diverse data repository to obtain clusters that distribute uniformly among data concepts, followed by a hierarchical, balanced sampling step from these clusters. Extensive experiments on three different data domains including web-based images, satellite images and text show that features trained on our automatically curated datasets outperform those trained on uncurated data while being on par or better than ones trained on manually curated data.

自我監督學習的自動數據整理：基於聚類的方法

Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach

摘要

Support