モデルから視覚を学ぶことは、データから視覚を学ぶことに匹敵する

要旨

本論文では、SynCLRという新しいアプローチを紹介します。これは、実データを一切使用せず、合成画像と合成キャプションのみから視覚表現を学習する手法です。まず、大規模言語モデル（LLM）を用いて大量の画像キャプションデータセットを合成し、次に既存のテキスト画像生成モデルを使用して、各合成キャプションに対応する複数の画像を生成します。これらの合成画像に対して、同じキャプションを共有する画像を正例ペアとして扱い、コントラスティブ学習による視覚表現学習を行います。その結果得られた表現は、多くの下流タスクにうまく転移し、画像分類タスクにおいてCLIPやDINO v2などの汎用視覚表現学習手法と好成績を競います。さらに、セマンティックセグメンテーションなどの密な予測タスクでは、SynCLRは従来の自己教師あり手法を大きく上回り、例えばADE20kデータセットにおけるViT-B/16のmIoUで、MAEやiBOTに対してそれぞれ6.2ポイントと4.3ポイントの改善を示します。

English

We introduce SynCLR, a novel approach for learning visual representations exclusively from synthetic images and synthetic captions, without any real data. We synthesize a large dataset of image captions using LLMs, then use an off-the-shelf text-to-image model to generate multiple images corresponding to each synthetic caption. We perform visual representation learning on these synthetic images via contrastive learning, treating images sharing the same caption as positive pairs. The resulting representations transfer well to many downstream tasks, competing favorably with other general-purpose visual representation learners such as CLIP and DINO v2 in image classification tasks. Furthermore, in dense prediction tasks such as semantic segmentation, SynCLR outperforms previous self-supervised methods by a significant margin, e.g., improving over MAE and iBOT by 6.2 and 4.3 mIoU on ADE20k for ViT-B/16.

モデルから視覚を学ぶことは、データから視覚を学ぶことに匹敵する

Learning Vision from Models Rivals Learning Vision from Data

要旨

Support