ChatPaper.aiChatPaper

EVA-CLIP-18B:將CLIP擴展至180億個參數

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

February 6, 2024
作者: Quan Sun, Jinsheng Wang, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Xinlong Wang
cs.AI

摘要

對比語言-圖像預訓練(CLIP)的擴展至關重要,可增強視覺和多模型的能力。我們介紹了 EVA-CLIP-18B,這是迄今為止最大且最強大的開源 CLIP 模型,擁有 180 億個參數。僅經過 60 億個訓練樣本,EVA-CLIP-18B 在 27 個廣泛認可的圖像分類基準測試中平均達到了卓越的 80.7% 零樣本頂部-1 準確率,遠遠優於其前身 EVA-CLIP(50 億個參數)和其他開源 CLIP 模型。值得注意的是,我們觀察到 EVA-CLIP 模型大小擴展時的一致性性能改善,儘管保持了來自 LAION-20 和 COYO-7 億的 20 億圖像-文本對訓練數據集不變。這個數據集是公開可用的,比其他最先進的 CLIP 模型中使用的內部數據集(例如 DFN-50、WebLI-100)要小得多。EVA-CLIP-18B 展示了 EVA 式弱到強視覺模型擴展的潛力。通過我們公開提供模型權重,我們希望促進未來在視覺和多模基礎模型方面的研究。
English
Scaling up contrastive language-image pretraining (CLIP) is critical for empowering both vision and multimodal models. We present EVA-CLIP-18B, the largest and most powerful open-source CLIP model to date, with 18-billion parameters. With only 6-billion training samples seen, EVA-CLIP-18B achieves an exceptional 80.7% zero-shot top-1 accuracy averaged across 27 widely recognized image classification benchmarks, outperforming its forerunner EVA-CLIP (5-billion parameters) and other open-source CLIP models by a large margin. Remarkably, we observe a consistent performance improvement with the model size scaling of EVA-CLIP, despite maintaining a constant training dataset of 2-billion image-text pairs from LAION-2B and COYO-700M. This dataset is openly available and much smaller than the in-house datasets (e.g., DFN-5B, WebLI-10B) employed in other state-of-the-art CLIP models. EVA-CLIP-18B demonstrates the potential of EVA-style weak-to-strong visual model scaling. With our model weights made publicly available, we hope to facilitate future research in vision and multimodal foundation models.
PDF292December 15, 2024