StableRep: Las imágenes sintéticas de modelos de texto a imagen son excelentes para el aprendizaje de representaciones visuales

Resumen

Investigamos el potencial de aprender representaciones visuales utilizando imágenes sintéticas generadas por modelos de texto a imagen. Esta es una pregunta natural a la luz del excelente rendimiento de dichos modelos en la generación de imágenes de alta calidad. Consideramos específicamente Stable Diffusion, uno de los principales modelos de texto a imagen de código abierto. Demostramos que (1) cuando el modelo generativo se configura con una escala adecuada de guía sin clasificador, el entrenamiento de métodos autosupervisados en imágenes sintéticas puede igualar o superar a su contraparte con imágenes reales; (2) al tratar las múltiples imágenes generadas a partir del mismo texto como positivas entre sí, desarrollamos un método de aprendizaje contrastivo multi-positivo, al que llamamos StableRep. Con únicamente imágenes sintéticas, las representaciones aprendidas por StableRep superan el rendimiento de las representaciones aprendidas por SimCLR y CLIP utilizando el mismo conjunto de textos y sus correspondientes imágenes reales, en conjuntos de datos a gran escala. Cuando añadimos supervisión lingüística, StableRep entrenado con 20M imágenes sintéticas logra una mayor precisión que CLIP entrenado con 50M imágenes reales.

English

We investigate the potential of learning visual representations using synthetic images generated by text-to-image models. This is a natural question in the light of the excellent performance of such models in generating high-quality images. We consider specifically the Stable Diffusion, one of the leading open source text-to-image models. We show that (1) when the generative model is configured with proper classifier-free guidance scale, training self-supervised methods on synthetic images can match or beat the real image counterpart; (2) by treating the multiple images generated from the same text prompt as positives for each other, we develop a multi-positive contrastive learning method, which we call StableRep. With solely synthetic images, the representations learned by StableRep surpass the performance of representations learned by SimCLR and CLIP using the same set of text prompts and corresponding real images, on large scale datasets. When we further add language supervision, StableRep trained with 20M synthetic images achieves better accuracy than CLIP trained with 50M real images.

StableRep: Las imágenes sintéticas de modelos de texto a imagen son excelentes para el aprendizaje de representaciones visuales

StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners

Resumen

Support