Una Legge di Scaling Inverso per l'Addestramento di CLIP

Abstract

CLIP, il primo modello di base che connette immagini e testo, ha abilitato numerose recenti scoperte nel campo della visione artificiale. Tuttavia, i costi associati al suo addestramento sono proibitivamente elevati, rappresentando una barriera significativa alla sua esplorazione diffusa. In questo articolo, presentiamo una scoperta sorprendente: esiste una legge di scala inversa per l'addestramento di CLIP, per cui più grandi sono gli encoder di immagini/testo utilizzati, più breve può essere la lunghezza della sequenza di token di immagini/testo applicata durante l'addestramento. Inoltre, dimostriamo che la strategia per ridurre la lunghezza dei token di immagini/testo gioca un ruolo cruciale nel determinare la qualità di questa legge di scala. Grazie a questa scoperta, siamo riusciti ad addestrare con successo CLIP utilizzando anche risorse accademiche. Ad esempio, su un server con otto GPU A100, i nostri modelli CLIP raggiungono accuratezze zero-shot top-1 su ImageNet del 63,2% in circa 2 giorni, del 67,8% in circa 3 giorni e del 69,3% in circa 4 giorni. Riducendo la barriera computazionale associata a CLIP, speriamo di ispirare ulteriori ricerche in questo campo, in particolare da parte del mondo accademico. Il nostro codice è disponibile all'indirizzo https://github.com/UCSC-VLAA/CLIPA.

English

CLIP, the first foundation model that connects images and text, has enabled many recent breakthroughs in computer vision. However, its associated training cost is prohibitively high, imposing a significant barrier to its widespread exploration. In this paper, we present a surprising finding that there exists an inverse scaling law for CLIP training, whereby the larger the image/text encoders used, the shorter the sequence length of image/text tokens that can be applied in training. Moreover, we showcase that the strategy for reducing image/text token length plays a crucial role in determining the quality of this scaling law. As a result of this finding, we are able to successfully train CLIP even by using academic resources. For example, on an A100 eight-GPU server, our CLIP models achieve zero-shot top-1 ImageNet accuracies of 63.2% in ~2 days, 67.8% in ~3 days, and 69.3% in ~4 days. By reducing the computation barrier associated with CLIP, we hope to inspire more research in this field, particularly from academics. Our code is available at https://github.com/UCSC-VLAA/CLIPA.

Una Legge di Scaling Inverso per l'Addestramento di CLIP

An Inverse Scaling Law for CLIP Training

Abstract

Support