CLIP 訓練的反向縮放定律

摘要

CLIP是第一個連接圖像和文字的基礎模型，已經在計算機視覺領域取得許多重大突破。然而，其相關的訓練成本過高，導致廣泛探索面臨重大障礙。本文提出一個驚人的發現，即CLIP訓練存在一個反比例尺定律，即使用較大的圖像/文字編碼器，可以應用較短的圖像/文字標記序列長度進行訓練。此外，我們展示了減少圖像/文字標記長度的策略在確定此尺度定律的質量方面發揮著至關重要的作用。由於這一發現，我們能夠成功地使用學術資源來訓練CLIP。例如，在一台搭載A100 GPU 的八卡伺服器上，我們的CLIP模型在約2天內達到了63.2%的零樣本 top-1 ImageNet 準確率，在約3天內達到了67.8%，在約4天內達到了69.3%。通過降低與CLIP相關的計算障礙，我們希望能激發更多學術界在這一領域的研究。我們的程式碼可在 https://github.com/UCSC-VLAA/CLIPA 找到。

English

CLIP, the first foundation model that connects images and text, has enabled many recent breakthroughs in computer vision. However, its associated training cost is prohibitively high, imposing a significant barrier to its widespread exploration. In this paper, we present a surprising finding that there exists an inverse scaling law for CLIP training, whereby the larger the image/text encoders used, the shorter the sequence length of image/text tokens that can be applied in training. Moreover, we showcase that the strategy for reducing image/text token length plays a crucial role in determining the quality of this scaling law. As a result of this finding, we are able to successfully train CLIP even by using academic resources. For example, on an A100 eight-GPU server, our CLIP models achieve zero-shot top-1 ImageNet accuracies of 63.2% in ~2 days, 67.8% in ~3 days, and 69.3% in ~4 days. By reducing the computation barrier associated with CLIP, we hope to inspire more research in this field, particularly from academics. Our code is available at https://github.com/UCSC-VLAA/CLIPA.

CLIP 訓練的反向縮放定律

An Inverse Scaling Law for CLIP Training

摘要

Support