StableRep: Le immagini sintetiche generate da modelli testo-immagine sono potenti strumenti per l'apprendimento di rappresentazioni visive

Abstract

Esploriamo il potenziale di apprendimento di rappresentazioni visive utilizzando immagini sintetiche generate da modelli testo-immagine. Questa è una domanda naturale alla luce delle eccellenti prestazioni di tali modelli nella generazione di immagini di alta qualità. Consideriamo in particolare Stable Diffusion, uno dei principali modelli testo-immagine open source. Dimostriamo che (1) quando il modello generativo è configurato con un'appropriata scala di guida senza classificatore, l'addestramento di metodi auto-supervisionati su immagini sintetiche può eguagliare o superare la controparte con immagini reali; (2) trattando le multiple immagini generate dallo stesso prompt di testo come positivi reciproci, sviluppiamo un metodo di apprendimento contrastivo multi-positivo, che chiamiamo StableRep. Utilizzando esclusivamente immagini sintetiche, le rappresentazioni apprese da StableRep superano le prestazioni delle rappresentazioni apprese da SimCLR e CLIP utilizzando lo stesso set di prompt di testo e le corrispondenti immagini reali, su dataset su larga scala. Quando aggiungiamo ulteriormente la supervisione linguistica, StableRep addestrato con 20 milioni di immagini sintetiche raggiunge una precisione migliore rispetto a CLIP addestrato con 50 milioni di immagini reali.

English

We investigate the potential of learning visual representations using synthetic images generated by text-to-image models. This is a natural question in the light of the excellent performance of such models in generating high-quality images. We consider specifically the Stable Diffusion, one of the leading open source text-to-image models. We show that (1) when the generative model is configured with proper classifier-free guidance scale, training self-supervised methods on synthetic images can match or beat the real image counterpart; (2) by treating the multiple images generated from the same text prompt as positives for each other, we develop a multi-positive contrastive learning method, which we call StableRep. With solely synthetic images, the representations learned by StableRep surpass the performance of representations learned by SimCLR and CLIP using the same set of text prompts and corresponding real images, on large scale datasets. When we further add language supervision, StableRep trained with 20M synthetic images achieves better accuracy than CLIP trained with 50M real images.

StableRep: Le immagini sintetiche generate da modelli testo-immagine sono potenti strumenti per l'apprendimento di rappresentazioni visive

StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners

Abstract

Support