ChatPaper.aiChatPaper

MoDE:通过聚类对CLIP数据专家进行分类

MoDE: CLIP Data Experts via Clustering

April 24, 2024
作者: Jiawei Ma, Po-Yao Huang, Saining Xie, Shang-Wen Li, Luke Zettlemoyer, Shih-Fu Chang, Wen-Tau Yih, Hu Xu
cs.AI

摘要

对比语言-图像预训练(CLIP)的成功取决于图像和标题之间配对的监督,这在网络抓取的数据中往往存在噪音。我们提出了数据专家混合(MoDE),通过聚类学习一组CLIP数据专家系统。每个数据专家在一个数据簇上训练,对其他簇中的假阴性噪声不太敏感。在推断时,我们通过应用通过任务元数据和簇条件之间的相关性确定的权重来集成它们的输出。为了准确估计相关性,一个簇中的样本应该在语义上相似,但数据专家的数量仍应适合训练和推断。因此,我们考虑人类语言中的本体论,并建议使用细粒度簇中心来代表每个数据专家在粗粒度级别上。实验研究表明,ViT-B/16上的四个CLIP数据专家在零样本图像分类方面优于OpenAI CLIP和OpenCLIP上的ViT-L/14,但训练成本较低(<35%)。同时,MoDE可以异步训练所有数据专家,并可以灵活地包含新的数据专家。代码可在https://github.com/facebookresearch/MetaCLIP/tree/main/mode找到。
English
The success of contrastive language-image pretraining (CLIP) relies on the supervision from the pairing between images and captions, which tends to be noisy in web-crawled data. We present Mixture of Data Experts (MoDE) and learn a system of CLIP data experts via clustering. Each data expert is trained on one data cluster, being less sensitive to false negative noises in other clusters. At inference time, we ensemble their outputs by applying weights determined through the correlation between task metadata and cluster conditions. To estimate the correlation precisely, the samples in one cluster should be semantically similar, but the number of data experts should still be reasonable for training and inference. As such, we consider the ontology in human language and propose to use fine-grained cluster centers to represent each data expert at a coarse-grained level. Experimental studies show that four CLIP data experts on ViT-B/16 outperform the ViT-L/14 by OpenAI CLIP and OpenCLIP on zero-shot image classification but with less (<35\%) training cost. Meanwhile, MoDE can train all data expert asynchronously and can flexibly include new data experts. The code is available at https://github.com/facebookresearch/MetaCLIP/tree/main/mode.

Summary

AI-Generated Summary

PDF151December 15, 2024