MetaCLIP 2: 전 세계적 확장을 위한 레시피

초록

대조적 언어-이미지 사전학습(Contrastive Language-Image Pretraining, CLIP)은 제로샷 분류, 검색부터 다중모드 대형 언어 모델(Multimodal Large Language Models, MLLMs)의 인코더까지 지원하는 인기 있는 기초 모델이다. CLIP은 영어권의 수십억 규모 이미지-텍스트 쌍으로 성공적으로 학습되었지만, 전 세계 웹 데이터로부터의 학습을 더 확장하는 것은 여전히 도전적이다: (1) 비영어권 데이터를 처리할 수 있는 큐레이션 방법이 부재하며, (2) 기존의 다국어 CLIP의 영어 성능이 영어 전용 버전보다 낮은, 즉 대형 언어 모델(LLMs)에서 흔히 나타나는 "다국어의 저주"가 존재한다. 본 논문에서는 전 세계 웹 규모의 이미지-텍스트 쌍으로 처음부터 CLIP을 학습하는 첫 번째 레시피인 MetaCLIP 2를 소개한다. 이러한 발견을 일반화하기 위해, 위의 도전 과제를 해결하기 위해 필요한 최소한의 변경으로 엄격한 절제 실험을 수행하고, 영어와 비영어권 데이터로부터 상호 이익을 얻을 수 있는 레시피를 제시한다. 제로샷 ImageNet 분류에서 MetaCLIP 2 ViT-H/14는 영어 전용 버전보다 0.8%, mSigLIP보다 0.7% 우수한 성능을 보였으며, CVQA에서 57.4%, Babel-ImageNet에서 50.2%, XM3600에서 64.3%의 이미지-텍스트 검색 성능을 달성하며 시스템 수준의 혼란 요인(예: 번역, 특수 아키텍처 변경) 없이 다국어 벤치마크에서 새로운 최첨단 기술을 설정했다.

English

Contrastive Language-Image Pretraining (CLIP) is a popular foundation model, supporting from zero-shot classification, retrieval to encoders for multimodal large language models (MLLMs). Although CLIP is successfully trained on billion-scale image-text pairs from the English world, scaling CLIP's training further to learning from the worldwide web data is still challenging: (1) no curation method is available to handle data points from non-English world; (2) the English performance from existing multilingual CLIP is worse than its English-only counterpart, i.e., "curse of multilinguality" that is common in LLMs. Here, we present MetaCLIP 2, the first recipe training CLIP from scratch on worldwide web-scale image-text pairs. To generalize our findings, we conduct rigorous ablations with minimal changes that are necessary to address the above challenges and present a recipe enabling mutual benefits from English and non-English world data. In zero-shot ImageNet classification, MetaCLIP 2 ViT-H/14 surpasses its English-only counterpart by 0.8% and mSigLIP by 0.7%, and surprisingly sets new state-of-the-art without system-level confounding factors (e.g., translation, bespoke architecture changes) on multilingual benchmarks, such as CVQA with 57.4%, Babel-ImageNet with 50.2% and XM3600 with 64.3% on image-to-text retrieval.

MetaCLIP 2: 전 세계적 확장을 위한 레시피

MetaCLIP 2: A Worldwide Scaling Recipe

초록

Support