StableRep: テキストから画像生成モデルによる合成画像は強力な視覚表現学習器となる

要旨

テキストから画像を生成するモデルを用いて生成された合成画像を用いた視覚表現学習の可能性を調査する。これは、そのようなモデルが高品質な画像を生成する優れた性能を示していることから、自然に導かれる疑問である。特に、主要なオープンソースのテキストから画像を生成するモデルであるStable Diffusionに焦点を当てる。我々は、(1) 生成モデルが適切なclassifier-free guidance scaleで設定されている場合、合成画像を用いた自己教師あり学習手法の訓練が、実画像を用いた場合と同等またはそれ以上の性能を発揮できることを示し、(2) 同じテキストプロンプトから生成された複数の画像を互いにポジティブサンプルとして扱うことで、multi-positive contrastive learning手法を開発し、これをStableRepと名付ける。合成画像のみを用いて、StableRepによって学習された表現は、大規模データセットにおいて、同じテキストプロンプトと対応する実画像を用いてSimCLRやCLIPによって学習された表現の性能を上回る。さらに言語監視を追加すると、20Mの合成画像で訓練されたStableRepは、50Mの実画像で訓練されたCLIPよりも高い精度を達成する。

English

We investigate the potential of learning visual representations using synthetic images generated by text-to-image models. This is a natural question in the light of the excellent performance of such models in generating high-quality images. We consider specifically the Stable Diffusion, one of the leading open source text-to-image models. We show that (1) when the generative model is configured with proper classifier-free guidance scale, training self-supervised methods on synthetic images can match or beat the real image counterpart; (2) by treating the multiple images generated from the same text prompt as positives for each other, we develop a multi-positive contrastive learning method, which we call StableRep. With solely synthetic images, the representations learned by StableRep surpass the performance of representations learned by SimCLR and CLIP using the same set of text prompts and corresponding real images, on large scale datasets. When we further add language supervision, StableRep trained with 20M synthetic images achieves better accuracy than CLIP trained with 50M real images.

StableRep: テキストから画像生成モデルによる合成画像は強力な視覚表現学習器となる

StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners

要旨

Support