SILC:利用自蒸馏改进视觉语言预训练
SILC: Improving Vision Language Pretraining with Self-Distillation
October 20, 2023
作者: Muhammad Ferjad Naeem, Yongqin Xian, Xiaohua Zhai, Lukas Hoyer, Luc Van Gool, Federico Tombari
cs.AI
摘要
基于网络规模的图像字幕数据集的图像文本预训练已经成为开放词汇分类和检索模型的默认方法,这要归功于CLIP及其变体的成功。一些研究还使用了CLIP特征进行密集预测任务,并展示了开放集能力的出现。然而,对比目标仅关注图像文本对齐,并不鼓励图像特征学习用于密集预测任务。在这项工作中,我们提出通过自蒸馏学习实现局部到全局对应关系的简单添加作为对比预训练的额外目标,从而提出SILC。我们展示,从指数移动平均(EMA)教师模型中提炼局部图像特征显著提高了模型在包括分类、检索和尤其是分割在内的多个计算机视觉任务上的性能。我们进一步展示,与基线相比,SILC在相同的训练持续时间内具有更好的扩展性。我们的模型SILC在零样本分类、少样本分类、图像和文本检索、零样本分割以及开放词汇分割方面树立了新的技术水平。
English
Image-Text pretraining on web-scale image caption dataset has become the
default recipe for open vocabulary classification and retrieval models thanks
to the success of CLIP and its variants. Several works have also used CLIP
features for dense prediction tasks and have shown the emergence of open-set
abilities. However, the contrastive objective only focuses on image-text
alignment and does not incentivise image feature learning for dense prediction
tasks. In this work, we propose the simple addition of local-to-global
correspondence learning by self-distillation as an additional objective for
contrastive pre-training to propose SILC. We show that distilling local image
features from an exponential moving average (EMA) teacher model significantly
improves model performance on several computer vision tasks including
classification, retrieval, and especially segmentation. We further show that
SILC scales better with the same training duration compared to the baselines.
Our model SILC sets a new state of the art for zero-shot classification, few
shot classification, image and text retrieval, zero-shot segmentation, and open
vocabulary segmentation.