데이터로부터 비전을 학습하는 것과 모델로부터 비전을 학습하는 것은 비견할 만하다.

초록

우리는 실제 데이터 없이도 합성 이미지와 합성 캡션만을 사용하여 시각적 표현을 학습하는 새로운 접근 방식인 SynCLR을 소개합니다. 대규모 언어 모델(LLM)을 사용하여 대량의 이미지 캡션 데이터셋을 합성한 후, 오프더셰프 텍스트-이미지 모델을 활용해 각 합성 캡션에 해당하는 여러 이미지를 생성합니다. 이 합성 이미지들에 대해 동일한 캡션을 공유하는 이미지들을 양성 쌍으로 간주하여 대조 학습(contrastive learning)을 수행함으로써 시각적 표현 학습을 진행합니다. 이렇게 학습된 표현은 다양한 다운스트림 작업에서 우수한 전이 성능을 보이며, CLIP이나 DINO v2와 같은 일반 목적의 시각적 표현 학습 모델과 비교해도 경쟁력 있는 성능을 보입니다. 특히, 시맨틱 세그멘테이션과 같은 밀집 예측(dense prediction) 작업에서는 SynCLR이 이전의 자기 지도 학습 방법들을 상당한 차이로 능가하며, ViT-B/16 모델을 기준으로 ADE20k 데이터셋에서 MAE와 iBOT에 비해 각각 6.2와 4.3 mIoU를 향상시킵니다.

English

We introduce SynCLR, a novel approach for learning visual representations exclusively from synthetic images and synthetic captions, without any real data. We synthesize a large dataset of image captions using LLMs, then use an off-the-shelf text-to-image model to generate multiple images corresponding to each synthetic caption. We perform visual representation learning on these synthetic images via contrastive learning, treating images sharing the same caption as positive pairs. The resulting representations transfer well to many downstream tasks, competing favorably with other general-purpose visual representation learners such as CLIP and DINO v2 in image classification tasks. Furthermore, in dense prediction tasks such as semantic segmentation, SynCLR outperforms previous self-supervised methods by a significant margin, e.g., improving over MAE and iBOT by 6.2 and 4.3 mIoU on ADE20k for ViT-B/16.

데이터로부터 비전을 학습하는 것과 모델로부터 비전을 학습하는 것은 비견할 만하다.

Learning Vision from Models Rivals Learning Vision from Data

초록

Support