MoDE: クラスタリングによるCLIPデータエキスパート

要旨

コントラスティブ言語画像事前学習（CLIP）の成功は、画像とキャプションのペアリングからの教師信号に依存していますが、ウェブクロールデータではノイズが含まれがちです。本論文では、データエキスパートの混合（Mixture of Data Experts, MoDE）を提案し、クラスタリングを通じてCLIPデータエキスパートのシステムを学習します。各データエキスパートは1つのデータクラスタで訓練され、他のクラスタの偽陰性ノイズに対して鈍感になります。推論時には、タスクメタデータとクラスタ条件の相関に基づいて決定された重みを適用し、それらの出力をアンサンブルします。この相関を正確に推定するため、1つのクラスタ内のサンプルは意味的に類似しているべきですが、データエキスパートの数は訓練と推論に適切な範囲に収める必要があります。そのため、人間の言語におけるオントロジーを考慮し、粗粒度レベルで各データエキスパートを表現するために細粒度クラスタセンターを使用することを提案します。実験的研究では、ViT-B/16上の4つのCLIPデータエキスパートが、OpenAI CLIPとOpenCLIPのViT-L/14をゼロショット画像分類において上回り、かつ訓練コストを35%未満に抑えることが示されました。また、MoDEはすべてのデータエキスパートを非同期に訓練でき、新しいデータエキスパートを柔軟に組み込むことができます。コードはhttps://github.com/facebookresearch/MetaCLIP/tree/main/modeで公開されています。

English

The success of contrastive language-image pretraining (CLIP) relies on the supervision from the pairing between images and captions, which tends to be noisy in web-crawled data. We present Mixture of Data Experts (MoDE) and learn a system of CLIP data experts via clustering. Each data expert is trained on one data cluster, being less sensitive to false negative noises in other clusters. At inference time, we ensemble their outputs by applying weights determined through the correlation between task metadata and cluster conditions. To estimate the correlation precisely, the samples in one cluster should be semantically similar, but the number of data experts should still be reasonable for training and inference. As such, we consider the ontology in human language and propose to use fine-grained cluster centers to represent each data expert at a coarse-grained level. Experimental studies show that four CLIP data experts on ViT-B/16 outperform the ViT-L/14 by OpenAI CLIP and OpenCLIP on zero-shot image classification but with less (<35\%) training cost. Meanwhile, MoDE can train all data expert asynchronously and can flexibly include new data experts. The code is available at https://github.com/facebookresearch/MetaCLIP/tree/main/mode.

MoDE: クラスタリングによるCLIPデータエキスパート

MoDE: CLIP Data Experts via Clustering

要旨

Summary

Support

Support