SILC：利用自蒸馏改进视觉语言预训练

摘要

基于网络规模的图像字幕数据集的图像文本预训练已经成为开放词汇分类和检索模型的默认方法，这要归功于CLIP及其变体的成功。一些研究还使用了CLIP特征进行密集预测任务，并展示了开放集能力的出现。然而，对比目标仅关注图像文本对齐，并不鼓励图像特征学习用于密集预测任务。在这项工作中，我们提出通过自蒸馏学习实现局部到全局对应关系的简单添加作为对比预训练的额外目标，从而提出SILC。我们展示，从指数移动平均（EMA）教师模型中提炼局部图像特征显著提高了模型在包括分类、检索和尤其是分割在内的多个计算机视觉任务上的性能。我们进一步展示，与基线相比，SILC在相同的训练持续时间内具有更好的扩展性。我们的模型SILC在零样本分类、少样本分类、图像和文本检索、零样本分割以及开放词汇分割方面树立了新的技术水平。

English

Image-Text pretraining on web-scale image caption dataset has become the default recipe for open vocabulary classification and retrieval models thanks to the success of CLIP and its variants. Several works have also used CLIP features for dense prediction tasks and have shown the emergence of open-set abilities. However, the contrastive objective only focuses on image-text alignment and does not incentivise image feature learning for dense prediction tasks. In this work, we propose the simple addition of local-to-global correspondence learning by self-distillation as an additional objective for contrastive pre-training to propose SILC. We show that distilling local image features from an exponential moving average (EMA) teacher model significantly improves model performance on several computer vision tasks including classification, retrieval, and especially segmentation. We further show that SILC scales better with the same training duration compared to the baselines. Our model SILC sets a new state of the art for zero-shot classification, few shot classification, image and text retrieval, zero-shot segmentation, and open vocabulary segmentation.

SILC：利用自蒸馏改进视觉语言预训练

SILC: Improving Vision Language Pretraining with Self-Distillation

摘要

Support