ReCLIP: 소스 프리 도메인 적응을 통한 대조적 언어-이미지 사전 학습 정제

초록

CLIP과 같은 대규모 사전 학습 비전-언어 모델은 제로샷 분류에서 뛰어난 성능을 보여주었으며, 예를 들어 ImageNet에서 어떠한 예시도 보지 않고도 76.3%의 top-1 정확도를 달성함으로써 레이블이 없는 데이터를 가진 많은 작업에 잠재적인 이점을 제공할 수 있음을 입증했습니다. 그러나 CLIP을 다운스트림 대상 도메인에 적용할 때, 시각 및 텍스트 도메인 간의 격차와 크로스 모달리티 불일치가 모델 성능에 큰 영향을 미칠 수 있습니다. 이러한 문제를 해결하기 위해, 우리는 소스 데이터나 대상 레이블 데이터가 필요 없는 최초의 소스 프리 도메인 적응 방법인 ReCLIP을 제안합니다. ReCLIP은 먼저 불일치된 시각-텍스트 임베딩을 완화하고 가짜 레이블을 학습하기 위한 투영 공간을 학습한 다음, 가짜 레이블을 사용하여 크로스 모달리티 자기 학습을 통해 시각 및 텍스트 인코더를 업데이트하고, 레이블을 정제하며, 도메인 격차와 불일치를 반복적으로 줄입니다. 광범위한 실험을 통해, ReCLIP이 22개의 이미지 분류 벤치마크에서 CLIP의 평균 오류율을 30.17%에서 25.06%로 감소시킴을 입증했습니다.

English

Large-scale Pre-Training Vision-Language Model such as CLIP has demonstrated outstanding performance in zero-shot classification, e.g. achieving 76.3% top-1 accuracy on ImageNet without seeing any example, which leads to potential benefits to many tasks that have no labeled data. However, while applying CLIP to a downstream target domain, the presence of visual and text domain gaps and cross-modality misalignment can greatly impact the model performance. To address such challenges, we propose ReCLIP, the first source-free domain adaptation method for vision-language models, which does not require any source data or target labeled data. ReCLIP first learns a projection space to mitigate the misaligned visual-text embeddings and learns pseudo labels, and then deploys cross-modality self-training with the pseudo labels, to update visual and text encoders, refine labels and reduce domain gaps and misalignments iteratively. With extensive experiments, we demonstrate ReCLIP reduces the average error rate of CLIP from 30.17% to 25.06% on 22 image classification benchmarks.

ReCLIP: 소스 프리 도메인 적응을 통한 대조적 언어-이미지 사전 학습 정제

ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation

초록

Support