MoDE: 透過聚類專家來解析CLIP數據

摘要

對比語言-圖像預訓練（CLIP）的成功取決於從圖像和標題之間的配對中獲得的監督，而這在網絡抓取的數據中往往存在噪音。我們提出了數據專家混合（MoDE），通過聚類學習了一個CLIP數據專家系統。每個數據專家在一個數據集群上接受訓練，對其他集群中的偽陰性噪音不太敏感。在推斷時，我們通過應用由任務元數據和集群條件之間的相關性確定的權重來集成它們的輸出。為了準確估計相關性，一個集群中的樣本應該在語義上相似，但數據專家的數量仍應適合訓練和推斷。因此，我們考慮了人類語言中的本體論，並建議使用細粒度集群中心來代表每個數據專家在粗粒度水平上。實驗研究表明，ViT-B/16上的四個CLIP數據專家在零樣本圖像分類方面優於OpenAI CLIP和OpenCLIP上的ViT-L/14，但訓練成本較低（<35%）。同時，MoDE可以異步訓練所有數據專家，並可以靈活地包含新的數據專家。代碼可在https://github.com/facebookresearch/MetaCLIP/tree/main/mode找到。

English

The success of contrastive language-image pretraining (CLIP) relies on the supervision from the pairing between images and captions, which tends to be noisy in web-crawled data. We present Mixture of Data Experts (MoDE) and learn a system of CLIP data experts via clustering. Each data expert is trained on one data cluster, being less sensitive to false negative noises in other clusters. At inference time, we ensemble their outputs by applying weights determined through the correlation between task metadata and cluster conditions. To estimate the correlation precisely, the samples in one cluster should be semantically similar, but the number of data experts should still be reasonable for training and inference. As such, we consider the ontology in human language and propose to use fine-grained cluster centers to represent each data expert at a coarse-grained level. Experimental studies show that four CLIP data experts on ViT-B/16 outperform the ViT-L/14 by OpenAI CLIP and OpenCLIP on zero-shot image classification but with less (<35\%) training cost. Meanwhile, MoDE can train all data expert asynchronously and can flexibly include new data experts. The code is available at https://github.com/facebookresearch/MetaCLIP/tree/main/mode.

MoDE: 透過聚類專家來解析CLIP數據

MoDE: CLIP Data Experts via Clustering

摘要

Support