CLIPA-v2：在10000預算內實現81.1％的零樣本ImageNet準確度；再增加4000預算可達到81.8％的準確度。

摘要

最近的研究 CLIPA 提出了一個與 CLIP 訓練相關的反向比例定律 -- 即使用較大的影像/文本編碼器，訓練時可以應用的影像/文本令牌序列長度會越短。這一發現使我們能夠以大幅減少計算量的方式訓練高性能的 CLIP 模型。基於這項工作，我們在此提出 CLIPA-v2，其中包含兩個關鍵貢獻。從技術上來說，我們發現這種反向比例定律在微調階段也適用，進一步減少了計算需求。從實證上來看，我們在大規模上探索了 CLIPA，將實驗擴展到 H/14 模型，訓練過程中觀察到約 130 億個影像-文本對。我們的結果令人振奮 -- 只需投入 10,000 美元的預算，我們的 CLIP 模型就實現了令人印象深刻的零樣本 ImageNet 準確率達到 81.1%，超越了先前最佳的 CLIP 模型（來自 OpenCLIP，80.1%）1.0%，同時將計算成本降低了約 39 倍。此外，再投資 4,000 美元，我們可以進一步將零樣本 ImageNet 準確率提升至 81.8%。我們的程式碼和模型可在 https://github.com/UCSC-VLAA/CLIPA 找到。

English

The recent work CLIPA presents an inverse scaling law for CLIP training -- whereby the larger the image/text encoders used, the shorter the sequence length of image/text tokens that can be applied in training. This finding enables us to train high-performance CLIP models with significantly reduced computations. Building upon this work, we hereby present CLIPA-v2 with two key contributions. Technically, we find this inverse scaling law is also applicable in the finetuning stage, enabling further reduction in computational needs. Empirically, we explore CLIPA at scale, extending the experiments up to the H/14 model with ~13B image-text pairs seen during training. Our results are exciting -- by only allocating a budget of \10,000, our CLIP model achieves an impressive zero-shot ImageNet accuracy of 81.1%, surpassing the prior best CLIP model (from OpenCLIP, 80.1%) by 1.0% and meanwhile reducing the computational cost by ~39X. Moreover, with an additional investment of 4,000, we can further elevate the zero-shot ImageNet accuracy to 81.8%. Our code and models are available at https://github.com/UCSC-VLAA/CLIPA.

CLIPA-v2：在10000預算內實現81.1％的零樣本ImageNet準確度；再增加4000預算可達到81.8％的準確度。

CLIPA-v2: Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy within a \10,000 Budget; An Extra 4,000 Unlocks 81.8% Accuracy

摘要

Support