ChatPaper.aiChatPaper

MetaCLIP 2:全球扩展方案

MetaCLIP 2: A Worldwide Scaling Recipe

July 29, 2025
作者: Yung-Sung Chuang, Yang Li, Dong Wang, Ching-Feng Yeh, Kehan Lyu, Ramya Raghavendra, James Glass, Lifei Huang, Jason Weston, Luke Zettlemoyer, Xinlei Chen, Zhuang Liu, Saining Xie, Wen-tau Yih, Shang-Wen Li, Hu Xu
cs.AI

摘要

对比语言-图像预训练(CLIP)作为一种广受欢迎的基础模型,支持从零样本分类、检索到多模态大语言模型(MLLMs)编码器的多种任务。尽管CLIP已在英语世界的数十亿级图文对上成功训练,但将其训练规模进一步扩展至全球网络数据仍面临挑战:(1)尚无有效的数据筛选方法处理非英语世界的数据点;(2)现有多语言CLIP在英语任务上的表现逊色于其仅英语训练的版本,即大语言模型(LLMs)中常见的“多语言诅咒”。在此,我们推出了MetaCLIP 2,这是首个基于全球网络规模图文对从头训练CLIP的方案。为了验证我们的发现,我们进行了严格的消融实验,仅引入解决上述挑战所必需的最小改动,并提出了一种能够实现英语与非英语世界数据互利的训练方案。在零样本ImageNet分类任务中,MetaCLIP 2 ViT-H/14模型超越了其仅英语训练的版本0.8%,并领先mSigLIP 0.7%,更令人瞩目的是,在无需系统级混杂因素(如翻译、特定架构调整)的情况下,于多语言基准测试中创下新纪录,例如在CVQA上达到57.4%,Babel-ImageNet上50.2%,以及XM3600图像到文本检索任务中64.3%的优异表现。
English
Contrastive Language-Image Pretraining (CLIP) is a popular foundation model, supporting from zero-shot classification, retrieval to encoders for multimodal large language models (MLLMs). Although CLIP is successfully trained on billion-scale image-text pairs from the English world, scaling CLIP's training further to learning from the worldwide web data is still challenging: (1) no curation method is available to handle data points from non-English world; (2) the English performance from existing multilingual CLIP is worse than its English-only counterpart, i.e., "curse of multilinguality" that is common in LLMs. Here, we present MetaCLIP 2, the first recipe training CLIP from scratch on worldwide web-scale image-text pairs. To generalize our findings, we conduct rigorous ablations with minimal changes that are necessary to address the above challenges and present a recipe enabling mutual benefits from English and non-English world data. In zero-shot ImageNet classification, MetaCLIP 2 ViT-H/14 surpasses its English-only counterpart by 0.8% and mSigLIP by 0.7%, and surprisingly sets new state-of-the-art without system-level confounding factors (e.g., translation, bespoke architecture changes) on multilingual benchmarks, such as CVQA with 57.4%, Babel-ImageNet with 50.2% and XM3600 with 64.3% on image-to-text retrieval.
PDF142July 31, 2025