自己教師あり学習のための自動データキュレーション：クラスタリングベースのアプローチ

要旨

自己教師あり特徴は現代の機械学習システムの基盤をなすものである。これらは通常、構築とキュレーションに多大な人的労力を要するデータコレクション上で事前学習される。この手動プロセスは、教師あり学習で遭遇するものと同様の制約を有しており、例えば、クラウドソーシングによるデータ選択はコストと時間がかかり、データセット規模の拡大を妨げている。本研究では、自己教師あり事前学習のための高品質データセットの自動キュレーション問題を考察する。我々は、そのようなデータセットは大規模で多様かつバランスが取れているべきだと主張し、これらの基準を全て満たすデータセットを構築するためのクラスタリングベースのアプローチを提案する。我々の手法は、データ概念間で均一に分布するクラスタを得るために、大規模で多様なデータリポジトリ上でk-meansを階層的に繰り返し適用し、その後これらのクラスタから階層的でバランスの取れたサンプリングを行うというものである。ウェブ画像、衛星画像、テキストという3つの異なるデータ領域での大規模な実験により、我々の自動キュレーションデータセットで学習した特徴量は、非キュレーションデータで学習したものを上回り、手動キュレーションデータで学習したものと同等かそれ以上の性能を示すことが確認された。

English

Self-supervised features are the cornerstone of modern machine learning systems. They are typically pre-trained on data collections whose construction and curation typically require extensive human effort. This manual process has some limitations similar to those encountered in supervised learning, e.g., the crowd-sourced selection of data is costly and time-consuming, preventing scaling the dataset size. In this work, we consider the problem of automatic curation of high-quality datasets for self-supervised pre-training. We posit that such datasets should be large, diverse and balanced, and propose a clustering-based approach for building ones satisfying all these criteria. Our method involves successive and hierarchical applications of k-means on a large and diverse data repository to obtain clusters that distribute uniformly among data concepts, followed by a hierarchical, balanced sampling step from these clusters. Extensive experiments on three different data domains including web-based images, satellite images and text show that features trained on our automatically curated datasets outperform those trained on uncurated data while being on par or better than ones trained on manually curated data.

自己教師あり学習のための自動データキュレーション：クラスタリングベースのアプローチ

Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach

要旨

Support