StableRep: 텍스트-이미지 모델의 합성 이미지가 강력한 시각 표현 학습기로 활용되다

초록

텍스트-이미지 모델이 생성한 합성 이미지를 활용하여 시각적 표현을 학습할 가능성을 탐구합니다. 이는 텍스트-이미지 모델이 고품질 이미지를 생성하는 데 있어 우수한 성능을 보이는 점을 고려할 때 자연스럽게 제기되는 질문입니다. 특히, 오픈소스 텍스트-이미지 모델 중 선도적인 Stable Diffusion을 중심으로 연구를 진행했습니다. 우리는 (1) 생성 모델이 적절한 classifier-free guidance scale로 설정되었을 때, 합성 이미지에서 자기 지도 학습 방법을 훈련하면 실제 이미지 대비 동등하거나 더 나은 성능을 달성할 수 있음을 보였으며, (2) 동일한 텍스트 프롬프트에서 생성된 여러 이미지를 서로에 대한 양성 샘플로 간주하여 다중 양성 대조 학습 방법을 개발했습니다. 이를 StableRep이라고 명명했습니다. 대규모 데이터셋에서, StableRep이 학습한 표현은 동일한 텍스트 프롬프트와 해당 실제 이미지를 사용한 SimCLR 및 CLIP의 성능을 능가했습니다. 추가로 언어 지도를 결합했을 때, 2천만 개의 합성 이미지로 훈련된 StableRep은 5천만 개의 실제 이미지로 훈련된 CLIP보다 더 높은 정확도를 달성했습니다.

English

We investigate the potential of learning visual representations using synthetic images generated by text-to-image models. This is a natural question in the light of the excellent performance of such models in generating high-quality images. We consider specifically the Stable Diffusion, one of the leading open source text-to-image models. We show that (1) when the generative model is configured with proper classifier-free guidance scale, training self-supervised methods on synthetic images can match or beat the real image counterpart; (2) by treating the multiple images generated from the same text prompt as positives for each other, we develop a multi-positive contrastive learning method, which we call StableRep. With solely synthetic images, the representations learned by StableRep surpass the performance of representations learned by SimCLR and CLIP using the same set of text prompts and corresponding real images, on large scale datasets. When we further add language supervision, StableRep trained with 20M synthetic images achieves better accuracy than CLIP trained with 50M real images.

StableRep: 텍스트-이미지 모델의 합성 이미지가 강력한 시각 표현 학습기로 활용되다

StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners

초록

Support