SILC:透過自我蒸餾改善視覺語言預訓練
SILC: Improving Vision Language Pretraining with Self-Distillation
October 20, 2023
作者: Muhammad Ferjad Naeem, Yongqin Xian, Xiaohua Zhai, Lukas Hoyer, Luc Van Gool, Federico Tombari
cs.AI
摘要
在網絡規模的圖像標題數據集上進行圖像文本預訓練已成為開放詞彙分類和檢索模型的默認方法,這要歸功於CLIP及其變體的成功。一些研究也利用CLIP特徵進行密集預測任務,並展示了開放集能力的出現。然而,對比目標僅關注圖像文本對齊,並不鼓勵圖像特徵學習用於密集預測任務。在這項研究中,我們提出了通過自蒸餾實現從局部到全局對應學習的簡單添加,作為對比預訓練的額外目標,提出了SILC。我們展示,從指數移動平均(EMA)教師模型中提煉局部圖像特徵,顯著提高了模型在多個計算機視覺任務上的性能,包括分類、檢索,特別是分割。我們進一步展示,與基線相比,SILC在相同的訓練持續時間內具有更好的擴展性。我們的模型SILC在零樣本分類、少樣本分類、圖像和文本檢索、零樣本分割和開放詞彙分割方面設立了新的技術水準。
English
Image-Text pretraining on web-scale image caption dataset has become the
default recipe for open vocabulary classification and retrieval models thanks
to the success of CLIP and its variants. Several works have also used CLIP
features for dense prediction tasks and have shown the emergence of open-set
abilities. However, the contrastive objective only focuses on image-text
alignment and does not incentivise image feature learning for dense prediction
tasks. In this work, we propose the simple addition of local-to-global
correspondence learning by self-distillation as an additional objective for
contrastive pre-training to propose SILC. We show that distilling local image
features from an exponential moving average (EMA) teacher model significantly
improves model performance on several computer vision tasks including
classification, retrieval, and especially segmentation. We further show that
SILC scales better with the same training duration compared to the baselines.
Our model SILC sets a new state of the art for zero-shot classification, few
shot classification, image and text retrieval, zero-shot segmentation, and open
vocabulary segmentation.