CLIP 데이터의 이해

초록

대조적 언어-이미지 사전학습(Contrastive Language-Image Pre-training, CLIP)은 컴퓨터 비전 분야의 연구와 응용을 발전시켜 현대의 인식 시스템과 생성 모델에 기여한 접근법입니다. 우리는 CLIP의 성공 요인이 모델 아키텍처나 사전학습 목표가 아니라 데이터에 있다고 믿습니다. 그러나 CLIP은 데이터와 수집 방식에 대해 매우 제한된 정보만을 제공하며, 이로 인해 CLIP의 모델 파라미터를 활용해 데이터를 재현하려는 연구들이 등장했습니다. 본 연구에서는 CLIP의 데이터 큐레이션 방식을 밝히고 이를 커뮤니티에 공개하기 위해 메타데이터 기반 언어-이미지 사전학습(Metadata-Curated Language-Image Pre-training, MetaCLIP)을 소개합니다. MetaCLIP은 원시 데이터 풀과 CLIP의 개념에서 도출된 메타데이터를 사용하여 메타데이터 분포에 따른 균형 잡힌 부분집합을 생성합니다. 우리의 실험 연구는 모델과 학습 설정을 엄격히 분리하여 데이터에만 집중합니다. 4억 개의 이미지-텍스트 데이터 쌍으로 구성된 CommonCrawl에 MetaCLIP을 적용한 결과, 여러 표준 벤치마크에서 CLIP의 데이터를 능가했습니다. 제로샷 ImageNet 분류에서 MetaCLIP은 ViT-B 모델 기준 70.8%의 정확도를 달성하며, CLIP의 68.3%를 넘어섰습니다. 동일한 학습 예산을 유지하면서 데이터를 10억 개로 확장했을 때는 72.4%의 정확도를 달성했습니다. 이러한 관찰 결과는 다양한 모델 크기에 걸쳐 일관되게 나타났으며, ViT-H 모델에서는 별도의 추가 기법 없이 80.5%의 정확도를 기록했습니다. 큐레이션 코드와 메타데이터에 따른 학습 데이터 분포는 https://github.com/facebookresearch/MetaCLIP에서 확인할 수 있습니다.

English

Contrastive Language-Image Pre-training (CLIP) is an approach that has advanced research and applications in computer vision, fueling modern recognition systems and generative models. We believe that the main ingredient to the success of CLIP is its data and not the model architecture or pre-training objective. However, CLIP only provides very limited information about its data and how it has been collected, leading to works that aim to reproduce CLIP's data by filtering with its model parameters. In this work, we intend to reveal CLIP's data curation approach and in our pursuit of making it open to the community introduce Metadata-Curated Language-Image Pre-training (MetaCLIP). MetaCLIP takes a raw data pool and metadata (derived from CLIP's concepts) and yields a balanced subset over the metadata distribution. Our experimental study rigorously isolates the model and training settings, concentrating solely on data. MetaCLIP applied to CommonCrawl with 400M image-text data pairs outperforms CLIP's data on multiple standard benchmarks. In zero-shot ImageNet classification, MetaCLIP achieves 70.8% accuracy, surpassing CLIP's 68.3% on ViT-B models. Scaling to 1B data, while maintaining the same training budget, attains 72.4%. Our observations hold across various model sizes, exemplified by ViT-H achieving 80.5%, without any bells-and-whistles. Curation code and training data distribution on metadata is made available at https://github.com/facebookresearch/MetaCLIP.

CLIP 데이터의 이해

Demystifying CLIP Data

초록

Support