CLIPA-v2：在10000个样本预算内实现81.1%的零样本ImageNet准确率；再增加4000个样本可达到81.8%的准确率。

摘要

最近的研究CLIPA提出了一个逆比例尺度定律，用于CLIP训练 -- 即使用更大的图像/文本编码器，可以应用更短的图像/文本令牌序列长度进行训练。这一发现使我们能够使用大大减少的计算量来训练高性能的CLIP模型。在这项工作的基础上，我们在此介绍了具有两个关键贡献的CLIPA-v2。从技术上讲，我们发现这一逆比例尺度定律在微调阶段也适用，可以进一步减少计算需求。从经验上看，我们在规模上探索了CLIPA，将实验扩展到H/14模型，训练中观察到约130亿个图像-文本对。我们的结果令人振奋 -- 仅分配10000美元的预算，我们的CLIP模型实现了令人印象深刻的零样本ImageNet准确率达到81.1%，超过了之前最佳的CLIP模型（来自OpenCLIP，80.1%）1.0%，同时将计算成本降低了约39倍。此外，再投资4000美元，我们可以进一步将零样本ImageNet准确率提升至81.8%。我们的代码和模型可在https://github.com/UCSC-VLAA/CLIPA 上找到。

English

The recent work CLIPA presents an inverse scaling law for CLIP training -- whereby the larger the image/text encoders used, the shorter the sequence length of image/text tokens that can be applied in training. This finding enables us to train high-performance CLIP models with significantly reduced computations. Building upon this work, we hereby present CLIPA-v2 with two key contributions. Technically, we find this inverse scaling law is also applicable in the finetuning stage, enabling further reduction in computational needs. Empirically, we explore CLIPA at scale, extending the experiments up to the H/14 model with ~13B image-text pairs seen during training. Our results are exciting -- by only allocating a budget of \10,000, our CLIP model achieves an impressive zero-shot ImageNet accuracy of 81.1%, surpassing the prior best CLIP model (from OpenCLIP, 80.1%) by 1.0% and meanwhile reducing the computational cost by ~39X. Moreover, with an additional investment of 4,000, we can further elevate the zero-shot ImageNet accuracy to 81.8%. Our code and models are available at https://github.com/UCSC-VLAA/CLIPA.

CLIPA-v2：在10000个样本预算内实现81.1%的零样本ImageNet准确率；再增加4000个样本可达到81.8%的准确率。

CLIPA-v2: Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy within a \10,000 Budget; An Extra 4,000 Unlocks 81.8% Accuracy

摘要

Support