揭開 CLIP 數據的神秘面紗
Demystifying CLIP Data
September 28, 2023
作者: Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, Christoph Feichtenhofer
cs.AI
摘要
對比式語言-圖像預訓練(CLIP)是一種方法,已推動計算機視覺領域的研究和應用,推動了現代識別系統和生成模型。我們認為 CLIP 成功的主要原因在於其數據,而非模型架構或預訓練目標。然而,CLIP 只提供了非常有限的有關其數據以及如何收集的信息,導致一些研究試圖通過使用其模型參數進行過濾以重現 CLIP 的數據。在這項工作中,我們打算揭示 CLIP 的數據策劃方法,並在我們致力於向社區開放的過程中引入 Metadata-Curated 語言-圖像預訓練(MetaCLIP)。MetaCLIP 採用原始數據池和元數據(從 CLIP 的概念中衍生)並生成一個在元數據分佈上平衡的子集。我們的實驗研究嚴格隔離了模型和訓練設置,僅專注於數據。MetaCLIP 應用於 CommonCrawl 的 4 億圖像-文本數據對,在多個標準基準測試中優於 CLIP 的數據。在零樣本 ImageNet 分類中,MetaCLIP 實現了 70.8% 的準確率,超越了 ViT-B 模型上 CLIP 的 68.3%。在保持相同訓練預算的情況下擴展到 10 億數據,達到了 72.4%。我們的觀察結果適用於各種模型大小,例如 ViT-H 實現了 80.5%,沒有任何花哨的技巧。有關策劃代碼和元數據上的訓練數據分佈可在 https://github.com/facebookresearch/MetaCLIP 找到。
English
Contrastive Language-Image Pre-training (CLIP) is an approach that has
advanced research and applications in computer vision, fueling modern
recognition systems and generative models. We believe that the main ingredient
to the success of CLIP is its data and not the model architecture or
pre-training objective. However, CLIP only provides very limited information
about its data and how it has been collected, leading to works that aim to
reproduce CLIP's data by filtering with its model parameters. In this work, we
intend to reveal CLIP's data curation approach and in our pursuit of making it
open to the community introduce Metadata-Curated Language-Image Pre-training
(MetaCLIP). MetaCLIP takes a raw data pool and metadata (derived from CLIP's
concepts) and yields a balanced subset over the metadata distribution. Our
experimental study rigorously isolates the model and training settings,
concentrating solely on data. MetaCLIP applied to CommonCrawl with 400M
image-text data pairs outperforms CLIP's data on multiple standard benchmarks.
In zero-shot ImageNet classification, MetaCLIP achieves 70.8% accuracy,
surpassing CLIP's 68.3% on ViT-B models. Scaling to 1B data, while maintaining
the same training budget, attains 72.4%. Our observations hold across various
model sizes, exemplified by ViT-H achieving 80.5%, without any
bells-and-whistles. Curation code and training data distribution on metadata is
made available at https://github.com/facebookresearch/MetaCLIP.