MoDE: 클러스터링을 통한 CLIP 데이터 전문가

초록

대조적 언어-이미지 사전 학습(CLIP)의 성공은 이미지와 캡션 간의 짝짓기로부터의 감독에 의존하며, 이는 웹 크롤링 데이터에서 노이즈가 있는 경향이 있습니다. 우리는 데이터 전문가 혼합(Mixture of Data Experts, MoDE)을 제시하고 클러스터링을 통해 CLIP 데이터 전문가 시스템을 학습시킵니다. 각 데이터 전문가는 하나의 데이터 클러스터에서 학습되며, 다른 클러스터의 거짓 부정 노이즈에 덜 민감합니다. 추론 시에는 작업 메타데이터와 클러스터 조건 간의 상관관계를 통해 결정된 가중치를 적용하여 이들의 출력을 앙상블합니다. 상관관계를 정확하게 추정하기 위해, 하나의 클러스터 내 샘플들은 의미적으로 유사해야 하지만, 데이터 전문가의 수는 여전히 학습과 추론에 적절해야 합니다. 이를 위해 우리는 인간 언어의 온톨로지를 고려하고, 각 데이터 전문가를 대략적인 수준에서 표현하기 위해 세분화된 클러스터 중심을 사용할 것을 제안합니다. 실험 연구는 ViT-B/16에서 네 개의 CLIP 데이터 전문가가 OpenAI CLIP와 OpenCLIP의 ViT-L/14보다 제로샷 이미지 분류에서 더 나은 성능을 보이지만, 더 적은 (<35\%) 학습 비용으로 가능함을 보여줍니다. 한편, MoDE는 모든 데이터 전문가를 비동기적으로 학습할 수 있으며, 새로운 데이터 전문가를 유연하게 포함할 수 있습니다. 코드는 https://github.com/facebookresearch/MetaCLIP/tree/main/mode에서 확인할 수 있습니다.

English

The success of contrastive language-image pretraining (CLIP) relies on the supervision from the pairing between images and captions, which tends to be noisy in web-crawled data. We present Mixture of Data Experts (MoDE) and learn a system of CLIP data experts via clustering. Each data expert is trained on one data cluster, being less sensitive to false negative noises in other clusters. At inference time, we ensemble their outputs by applying weights determined through the correlation between task metadata and cluster conditions. To estimate the correlation precisely, the samples in one cluster should be semantically similar, but the number of data experts should still be reasonable for training and inference. As such, we consider the ontology in human language and propose to use fine-grained cluster centers to represent each data expert at a coarse-grained level. Experimental studies show that four CLIP data experts on ViT-B/16 outperform the ViT-L/14 by OpenAI CLIP and OpenCLIP on zero-shot image classification but with less (<35\%) training cost. Meanwhile, MoDE can train all data expert asynchronously and can flexibly include new data experts. The code is available at https://github.com/facebookresearch/MetaCLIP/tree/main/mode.

MoDE: 클러스터링을 통한 CLIP 데이터 전문가

MoDE: CLIP Data Experts via Clustering

초록

Support