StableRep：從文本到圖像模型生成的合成圖像為強大的視覺表示學習者

摘要

我們研究了利用由文本到圖像模型生成的合成圖像來學習視覺表示的潛力。考慮到這些模型在生成高質量圖像方面的出色表現，這是一個自然的問題。我們具體考慮了Stable Diffusion，這是領先的開源文本到圖像模型之一。我們表明：(1) 當生成模型配置了適當的無分類器指導尺度時，在合成圖像上訓練自監督方法可以匹敵或超越真實圖像對應物；(2) 通過將從同一文本提示生成的多個圖像視為彼此的正例，我們開發了一種多正例對比學習方法，我們稱之為StableRep。僅使用合成圖像，StableRep學習的表示優於使用相同文本提示集和相應真實圖像的SimCLR和CLIP學習的表示，在大規模數據集上。當我們進一步添加語言監督時，使用2000萬合成圖像訓練的StableRep的準確性優於使用5000萬真實圖像訓練的CLIP。

English

We investigate the potential of learning visual representations using synthetic images generated by text-to-image models. This is a natural question in the light of the excellent performance of such models in generating high-quality images. We consider specifically the Stable Diffusion, one of the leading open source text-to-image models. We show that (1) when the generative model is configured with proper classifier-free guidance scale, training self-supervised methods on synthetic images can match or beat the real image counterpart; (2) by treating the multiple images generated from the same text prompt as positives for each other, we develop a multi-positive contrastive learning method, which we call StableRep. With solely synthetic images, the representations learned by StableRep surpass the performance of representations learned by SimCLR and CLIP using the same set of text prompts and corresponding real images, on large scale datasets. When we further add language supervision, StableRep trained with 20M synthetic images achieves better accuracy than CLIP trained with 50M real images.

StableRep：從文本到圖像模型生成的合成圖像為強大的視覺表示學習者

StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners

摘要

Support