揭秘CLIP数据
Demystifying CLIP Data
September 28, 2023
作者: Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, Christoph Feichtenhofer
cs.AI
摘要
对比语言-图像预训练(CLIP)是一种方法,推动了计算机视觉领域的研究和应用,推动了现代识别系统和生成模型的发展。我们认为CLIP成功的主要因素是其数据,而不是模型架构或预训练目标。然而,CLIP仅提供了有关其数据以及数据收集方式的非常有限信息,导致了一些研究旨在通过使用其模型参数进行过滤来复现CLIP的数据。在这项工作中,我们打算揭示CLIP的数据策划方法,并且在努力使其对社区开放的过程中引入了元数据策划的语言-图像预训练(MetaCLIP)。MetaCLIP利用原始数据池和元数据(从CLIP的概念中派生)产生一个在元数据分布上平衡的子集。我们的实验研究严格隔离了模型和训练设置,仅集中在数据上。MetaCLIP应用于CommonCrawl的4亿图像文本数据对在多个标准基准测试中优于CLIP的数据。在零样本ImageNet分类中,MetaCLIP实现了70.8%的准确率,超过了ViT-B模型上CLIP的68.3%。在保持相同训练预算的情况下扩展到10亿数据时,准确率达到了72.4%。我们的观察结果适用于各种模型大小,例如ViT-H实现了80.5%的准确率,没有任何花哨的技巧。元数据上的策划代码和训练数据分布可在https://github.com/facebookresearch/MetaCLIP 上获得。
English
Contrastive Language-Image Pre-training (CLIP) is an approach that has
advanced research and applications in computer vision, fueling modern
recognition systems and generative models. We believe that the main ingredient
to the success of CLIP is its data and not the model architecture or
pre-training objective. However, CLIP only provides very limited information
about its data and how it has been collected, leading to works that aim to
reproduce CLIP's data by filtering with its model parameters. In this work, we
intend to reveal CLIP's data curation approach and in our pursuit of making it
open to the community introduce Metadata-Curated Language-Image Pre-training
(MetaCLIP). MetaCLIP takes a raw data pool and metadata (derived from CLIP's
concepts) and yields a balanced subset over the metadata distribution. Our
experimental study rigorously isolates the model and training settings,
concentrating solely on data. MetaCLIP applied to CommonCrawl with 400M
image-text data pairs outperforms CLIP's data on multiple standard benchmarks.
In zero-shot ImageNet classification, MetaCLIP achieves 70.8% accuracy,
surpassing CLIP's 68.3% on ViT-B models. Scaling to 1B data, while maintaining
the same training budget, attains 72.4%. Our observations hold across various
model sizes, exemplified by ViT-H achieving 80.5%, without any
bells-and-whistles. Curation code and training data distribution on metadata is
made available at https://github.com/facebookresearch/MetaCLIP.