CLIPトレーニングにおける逆スケーリング則

要旨

CLIPは、画像とテキストを結びつける最初の基盤モデルとして、コンピュータビジョン分野における多くの最近のブレークスルーを可能にしてきました。しかし、その関連するトレーニングコストは非常に高く、広範な探求に大きな障壁となっています。本論文では、CLIPのトレーニングにおいて逆スケーリング則が存在するという驚くべき発見を提示します。この法則によれば、使用する画像/テキストエンコーダが大きいほど、トレーニングに適用できる画像/テキストトークンのシーケンス長は短くなります。さらに、画像/テキストトークンの長さを削減する戦略が、このスケーリング則の質を決定する上で重要な役割を果たすことを示します。この発見の結果として、学術的なリソースを使用してもCLIPのトレーニングに成功することができました。例えば、A100 8-GPUサーバー上で、私たちのCLIPモデルは、約2日で63.2%、約3日で67.8%、約4日で69.3%のゼロショットTop-1 ImageNet精度を達成しました。CLIPに関連する計算障壁を低減することで、特に学術界からのこの分野におけるさらなる研究を促進することを期待しています。私たちのコードはhttps://github.com/UCSC-VLAA/CLIPAで公開されています。

English

CLIP, the first foundation model that connects images and text, has enabled many recent breakthroughs in computer vision. However, its associated training cost is prohibitively high, imposing a significant barrier to its widespread exploration. In this paper, we present a surprising finding that there exists an inverse scaling law for CLIP training, whereby the larger the image/text encoders used, the shorter the sequence length of image/text tokens that can be applied in training. Moreover, we showcase that the strategy for reducing image/text token length plays a crucial role in determining the quality of this scaling law. As a result of this finding, we are able to successfully train CLIP even by using academic resources. For example, on an A100 eight-GPU server, our CLIP models achieve zero-shot top-1 ImageNet accuracies of 63.2% in ~2 days, 67.8% in ~3 days, and 69.3% in ~4 days. By reducing the computation barrier associated with CLIP, we hope to inspire more research in this field, particularly from academics. Our code is available at https://github.com/UCSC-VLAA/CLIPA.

CLIPトレーニングにおける逆スケーリング則

An Inverse Scaling Law for CLIP Training

要旨

Support