CLIP 训练的反比例定律

摘要

CLIP是连接图像和文本的第一个基础模型，已经在计算机视觉领域取得了许多重大突破。然而，其相关的训练成本过高，这对其广泛探索构成了重大障碍。本文提出了一个令人惊讶的发现，即CLIP训练存在一个反向缩放定律，即使用更大的图像/文本编码器，可以应用于训练的图像/文本令牌序列长度就越短。此外，我们展示了减少图像/文本令牌长度的策略在确定该缩放定律的质量方面起着至关重要的作用。由于这一发现，我们成功地使用学术资源训练了CLIP。例如，在一台A100八GPU服务器上，我们的CLIP模型在约2天内达到了63.2%的零样本ImageNet top-1准确率，在约3天内达到了67.8%，在约4天内达到了69.3%。通过降低与CLIP相关的计算障碍，我们希望激发更多学术界在这一领域的研究。我们的代码可在https://github.com/UCSC-VLAA/CLIPA找到。

English

CLIP, the first foundation model that connects images and text, has enabled many recent breakthroughs in computer vision. However, its associated training cost is prohibitively high, imposing a significant barrier to its widespread exploration. In this paper, we present a surprising finding that there exists an inverse scaling law for CLIP training, whereby the larger the image/text encoders used, the shorter the sequence length of image/text tokens that can be applied in training. Moreover, we showcase that the strategy for reducing image/text token length plays a crucial role in determining the quality of this scaling law. As a result of this finding, we are able to successfully train CLIP even by using academic resources. For example, on an A100 eight-GPU server, our CLIP models achieve zero-shot top-1 ImageNet accuracies of 63.2% in ~2 days, 67.8% in ~3 days, and 69.3% in ~4 days. By reducing the computation barrier associated with CLIP, we hope to inspire more research in this field, particularly from academics. Our code is available at https://github.com/UCSC-VLAA/CLIPA.

CLIP 训练的反比例定律

An Inverse Scaling Law for CLIP Training

摘要

Support