从模型学习视觉与从数据学习视觉的竞争

摘要

我们介绍了SynCLR，这是一种新颖的方法，专门从合成图像和合成标题中学习视觉表示，而无需任何真实数据。我们使用LLMs合成了一个大型图像标题数据集，然后利用现成的文本到图像模型生成与每个合成标题对应的多个图像。我们通过对比学习在这些合成图像上进行视觉表示学习，将共享相同标题的图像视为正对。由此产生的表示在许多下游任务中具有良好的迁移性能，在图像分类任务中与其他通用视觉表示学习器（如CLIP和DINO v2）竞争激烈。此外，在诸如语义分割之类的密集预测任务中，SynCLR在性能上优于以前的自监督方法，例如，在ADE20k的ViT-B/16上，相对于MAE和iBOT，mIoU提高了6.2和4.3个百分点。

English

We introduce SynCLR, a novel approach for learning visual representations exclusively from synthetic images and synthetic captions, without any real data. We synthesize a large dataset of image captions using LLMs, then use an off-the-shelf text-to-image model to generate multiple images corresponding to each synthetic caption. We perform visual representation learning on these synthetic images via contrastive learning, treating images sharing the same caption as positive pairs. The resulting representations transfer well to many downstream tasks, competing favorably with other general-purpose visual representation learners such as CLIP and DINO v2 in image classification tasks. Furthermore, in dense prediction tasks such as semantic segmentation, SynCLR outperforms previous self-supervised methods by a significant margin, e.g., improving over MAE and iBOT by 6.2 and 4.3 mIoU on ADE20k for ViT-B/16.

从模型学习视觉与从数据学习视觉的竞争

Learning Vision from Models Rivals Learning Vision from Data

摘要

Support