ReCLIP：利用源自由领域自适应优化对比语言图像预训练

摘要

大规模预训练视觉-语言模型，如CLIP，在零样本分类方面表现出色，例如在ImageNet上实现了76.3%的top-1准确率，而无需查看任何示例，这为许多没有标记数据的任务带来潜在好处。然而，在将CLIP应用于下游目标领域时，视觉和文本领域差距以及跨模态不对齐可能会严重影响模型性能。为了解决这些挑战，我们提出了ReCLIP，这是视觉-语言模型的第一个无源域自适应方法，不需要任何源数据或目标标记数据。ReCLIP首先学习一个投影空间来减轻不对齐的视觉-文本嵌入，并学习伪标签，然后使用伪标签部署跨模态自训练，以迭代更新视觉和文本编码器，优化标签，减少领域差距和不对齐。通过大量实验，我们展示了ReCLIP将CLIP的平均错误率从30.17%降低到了25.06%，在22个图像分类基准测试中。

English

Large-scale Pre-Training Vision-Language Model such as CLIP has demonstrated outstanding performance in zero-shot classification, e.g. achieving 76.3% top-1 accuracy on ImageNet without seeing any example, which leads to potential benefits to many tasks that have no labeled data. However, while applying CLIP to a downstream target domain, the presence of visual and text domain gaps and cross-modality misalignment can greatly impact the model performance. To address such challenges, we propose ReCLIP, the first source-free domain adaptation method for vision-language models, which does not require any source data or target labeled data. ReCLIP first learns a projection space to mitigate the misaligned visual-text embeddings and learns pseudo labels, and then deploys cross-modality self-training with the pseudo labels, to update visual and text encoders, refine labels and reduce domain gaps and misalignments iteratively. With extensive experiments, we demonstrate ReCLIP reduces the average error rate of CLIP from 30.17% to 25.06% on 22 image classification benchmarks.

ReCLIP：利用源自由领域自适应优化对比语言图像预训练

ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation

摘要

Support