StableRep:从文本到图像模型生成的合成图像,构建强大的视觉表示学习器
StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners
June 1, 2023
作者: Yonglong Tian, Lijie Fan, Phillip Isola, Huiwen Chang, Dilip Krishnan
cs.AI
摘要
我们研究了利用文本到图像模型生成的合成图像来学习视觉表示的潜力。鉴于这类模型在生成高质量图像方面表现出色,这是一个自然的问题。我们具体考虑了Stable Diffusion,这是一种领先的开源文本到图像模型。我们展示了:(1) 当生成模型配置了适当的无分类器指导尺度时,在合成图像上进行自监督方法训练可以达到或超越真实图像对应物;(2) 通过将从相同文本提示生成的多个图像视为彼此的正例,我们开发了一种多正对比学习方法,我们称之为StableRep。仅使用合成图像,StableRep学习的表示在大规模数据集上超越了SimCLR和CLIP使用相同文本提示集和对应真实图像学习的表示性能。当我们进一步添加语言监督时,使用2000万合成图像训练的StableRep的准确性优于使用5000万真实图像训练的CLIP。
English
We investigate the potential of learning visual representations using
synthetic images generated by text-to-image models. This is a natural question
in the light of the excellent performance of such models in generating
high-quality images. We consider specifically the Stable Diffusion, one of the
leading open source text-to-image models. We show that (1) when the generative
model is configured with proper classifier-free guidance scale, training
self-supervised methods on synthetic images can match or beat the real image
counterpart; (2) by treating the multiple images generated from the same text
prompt as positives for each other, we develop a multi-positive contrastive
learning method, which we call StableRep. With solely synthetic images, the
representations learned by StableRep surpass the performance of representations
learned by SimCLR and CLIP using the same set of text prompts and corresponding
real images, on large scale datasets. When we further add language supervision,
StableRep trained with 20M synthetic images achieves better accuracy than CLIP
trained with 50M real images.