從模型學習視覺技能與從數據學習視覺技能的競爭

摘要

我們介紹了 SynCLR，一種新穎的方法，專門從合成圖像和合成標題中學習視覺表示，而不使用任何真實數據。我們使用LLM合成了一個大型圖像標題數據集，然後利用現成的文本到圖像模型生成與每個合成標題對應的多個圖像。我們通過對比學習在這些合成圖像上進行視覺表示學習，將共享相同標題的圖像視為正對。結果的表示在許多下游任務上轉移效果良好，與其他通用視覺表示學習者（如CLIP和DINO v2）在圖像分類任務中競爭激烈。此外，在密集預測任務（如語義分割）中，SynCLR在性能上明顯優於以前的自監督方法，例如在ADE20k上，對於ViT-B/16，其在MAE和iBOT上的mIoU分別提高了6.2和4.3。

English

We introduce SynCLR, a novel approach for learning visual representations exclusively from synthetic images and synthetic captions, without any real data. We synthesize a large dataset of image captions using LLMs, then use an off-the-shelf text-to-image model to generate multiple images corresponding to each synthetic caption. We perform visual representation learning on these synthetic images via contrastive learning, treating images sharing the same caption as positive pairs. The resulting representations transfer well to many downstream tasks, competing favorably with other general-purpose visual representation learners such as CLIP and DINO v2 in image classification tasks. Furthermore, in dense prediction tasks such as semantic segmentation, SynCLR outperforms previous self-supervised methods by a significant margin, e.g., improving over MAE and iBOT by 6.2 and 4.3 mIoU on ADE20k for ViT-B/16.

從模型學習視覺技能與從數據學習視覺技能的競爭

Learning Vision from Models Rivals Learning Vision from Data

摘要

Support