ReCLIP：利用無源領域適應來優化對比語言圖像預訓練

摘要

大規模預訓練視覺語言模型，如CLIP，在零樣本分類方面表現出色，例如在ImageNet上實現了76.3%的頂級1準確率，而無需看到任何示例，這對許多沒有標記數據的任務帶來潛在好處。然而，將CLIP應用於下游目標領域時，視覺和文本領域之間的差距以及跨模態不對齊可能會嚴重影響模型性能。為了應對這些挑戰，我們提出了ReCLIP，這是用於視覺語言模型的第一種無源領域適應方法，不需要任何源數據或目標標記數據。ReCLIP首先學習一個投影空間來減輕不對齊的視覺-文本嵌入，並學習虛標籤，然後使用虛標籤部署跨模態自我訓練，以更新視覺和文本編碼器，優化標籤，並迭代地減少領域差距和不對齊。通過大量實驗，我們展示ReCLIP將CLIP的平均錯誤率從30.17%降低到25.06%，在22個圖像分類基準測試中。

English

Large-scale Pre-Training Vision-Language Model such as CLIP has demonstrated outstanding performance in zero-shot classification, e.g. achieving 76.3% top-1 accuracy on ImageNet without seeing any example, which leads to potential benefits to many tasks that have no labeled data. However, while applying CLIP to a downstream target domain, the presence of visual and text domain gaps and cross-modality misalignment can greatly impact the model performance. To address such challenges, we propose ReCLIP, the first source-free domain adaptation method for vision-language models, which does not require any source data or target labeled data. ReCLIP first learns a projection space to mitigate the misaligned visual-text embeddings and learns pseudo labels, and then deploys cross-modality self-training with the pseudo labels, to update visual and text encoders, refine labels and reduce domain gaps and misalignments iteratively. With extensive experiments, we demonstrate ReCLIP reduces the average error rate of CLIP from 30.17% to 25.06% on 22 image classification benchmarks.

ReCLIP：利用無源領域適應來優化對比語言圖像預訓練

ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation

摘要

Support