ChatPaper.aiChatPaper

EVA-CLIP-18B:将CLIP扩展到180亿参数

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

February 6, 2024
作者: Quan Sun, Jinsheng Wang, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Xinlong Wang
cs.AI

摘要

扩展对比语言-图像预训练(CLIP)在赋能视觉和多模态模型方面至关重要。我们推出了EVA-CLIP-18B,迄今为止最大且最强大的开源CLIP模型,拥有180亿个参数。仅经历了60亿个训练样本,EVA-CLIP-18B在27个广泛认可的图像分类基准测试中取得了卓越的80.7%零样本top-1准确率,远远超过其前身EVA-CLIP(50亿参数)和其他开源CLIP模型。值得注意的是,尽管保持了来自LAION-20B和COYO-700M的20亿图像-文本对训练数据集不变,我们观察到EVA-CLIP模型规模扩大时的持续性能改进。该数据集是公开可用的,远小于其他最先进CLIP模型中使用的内部数据集(例如DFN-50B、WebLI-100B)。EVA-CLIP-18B展示了EVA风格的弱到强视觉模型扩展的潜力。通过公开我们的模型权重,我们希望促进未来在视觉和多模态基础模型方面的研究。
English
Scaling up contrastive language-image pretraining (CLIP) is critical for empowering both vision and multimodal models. We present EVA-CLIP-18B, the largest and most powerful open-source CLIP model to date, with 18-billion parameters. With only 6-billion training samples seen, EVA-CLIP-18B achieves an exceptional 80.7% zero-shot top-1 accuracy averaged across 27 widely recognized image classification benchmarks, outperforming its forerunner EVA-CLIP (5-billion parameters) and other open-source CLIP models by a large margin. Remarkably, we observe a consistent performance improvement with the model size scaling of EVA-CLIP, despite maintaining a constant training dataset of 2-billion image-text pairs from LAION-2B and COYO-700M. This dataset is openly available and much smaller than the in-house datasets (e.g., DFN-5B, WebLI-10B) employed in other state-of-the-art CLIP models. EVA-CLIP-18B demonstrates the potential of EVA-style weak-to-strong visual model scaling. With our model weights made publicly available, we hope to facilitate future research in vision and multimodal foundation models.
PDF292December 15, 2024