视觉-语言预训练的改进基线

摘要

对比学习已成为学习多模态表示的高效框架。CLIP是这一领域的开创性工作，通过使用对比损失在配对的图像文本数据上训练取得了令人印象深刻的结果。最近的研究声称利用受自监督学习启发的额外非对比损失改进了CLIP。然而，有时很难将这些额外损失对模型训练的其他实现细节（如数据增强或正则化技术）的贡献与之区分开。为了阐明这一问题，在本文中，我们首先提出、实现并评估了几种基线模型，将对比学习与自监督学习的最新进展结合起来。具体来说，我们使用已被证明对视觉自监督学习成功的损失函数来对齐图像和文本模态。我们发现这些基线模型胜过了CLIP的基本实现。然而，当采用更强大的训练配方时，这种优势就消失了。事实上，我们发现一个简单的CLIP基线模型也可以通过使用其他子领域中流行的众所周知的训练技术显著改进，最多可在下游零样本任务上提高25%。此外，我们发现仅需应用图像和文本增强即可弥补先前工作所获得改进的大部分。通过我们改进的CLIP训练配方，在四个标准数据集上获得了最先进的性能，并且在相当简单的情况下始终胜过先前的工作（在最大数据集上最多提高了+4%），

English

Contrastive learning has emerged as an efficient framework to learn multimodal representations. CLIP, a seminal work in this area, achieved impressive results by training on paired image-text data using the contrastive loss. Recent work claims improvements over CLIP using additional non-contrastive losses inspired from self-supervised learning. However, it is sometimes hard to disentangle the contribution of these additional losses from other implementation details, e.g., data augmentation or regularization techniques, used to train the model. To shed light on this matter, in this paper, we first propose, implement and evaluate several baselines obtained by combining contrastive learning with recent advances in self-supervised learning. In particular, we use the loss functions that were proven successful for visual self-supervised learning to align image and text modalities. We find that these baselines outperform a basic implementation of CLIP. However, when a stronger training recipe is employed, the advantage disappears. Indeed, we find that a simple CLIP baseline can also be improved substantially, up to a 25% relative improvement on downstream zero-shot tasks, by using well-known training techniques that are popular in other subfields. Moreover, we discover that it is enough to apply image and text augmentations to make up for most of the improvement attained by prior works. With our improved training recipe for CLIP, we obtain state-of-the-art performance on four standard datasets, and consistently outperform prior work (up to +4% on the largest dataset), while being substantially simpler.

视觉-语言预训练的改进基线

Improved baselines for vision-language pre-training

摘要

Support