SILC：透過自我蒸餾改善視覺語言預訓練

摘要

在網絡規模的圖像標題數據集上進行圖像文本預訓練已成為開放詞彙分類和檢索模型的默認方法，這要歸功於CLIP及其變體的成功。一些研究也利用CLIP特徵進行密集預測任務，並展示了開放集能力的出現。然而，對比目標僅關注圖像文本對齊，並不鼓勵圖像特徵學習用於密集預測任務。在這項研究中，我們提出了通過自蒸餾實現從局部到全局對應學習的簡單添加，作為對比預訓練的額外目標，提出了SILC。我們展示，從指數移動平均（EMA）教師模型中提煉局部圖像特徵，顯著提高了模型在多個計算機視覺任務上的性能，包括分類、檢索，特別是分割。我們進一步展示，與基線相比，SILC在相同的訓練持續時間內具有更好的擴展性。我們的模型SILC在零樣本分類、少樣本分類、圖像和文本檢索、零樣本分割和開放詞彙分割方面設立了新的技術水準。

English

Image-Text pretraining on web-scale image caption dataset has become the default recipe for open vocabulary classification and retrieval models thanks to the success of CLIP and its variants. Several works have also used CLIP features for dense prediction tasks and have shown the emergence of open-set abilities. However, the contrastive objective only focuses on image-text alignment and does not incentivise image feature learning for dense prediction tasks. In this work, we propose the simple addition of local-to-global correspondence learning by self-distillation as an additional objective for contrastive pre-training to propose SILC. We show that distilling local image features from an exponential moving average (EMA) teacher model significantly improves model performance on several computer vision tasks including classification, retrieval, and especially segmentation. We further show that SILC scales better with the same training duration compared to the baselines. Our model SILC sets a new state of the art for zero-shot classification, few shot classification, image and text retrieval, zero-shot segmentation, and open vocabulary segmentation.

SILC：透過自我蒸餾改善視覺語言預訓練

SILC: Improving Vision Language Pretraining with Self-Distillation

摘要

Support