視覺-語言預訓練的改進基準

摘要

對比學習已成為學習多模態表示的有效框架。在這個領域的開創性工作CLIP通過使用對比損失在配對的圖像-文本數據上進行訓練，取得了令人印象深刻的成果。最近的研究聲稱利用從自監督學習中獲得靈感的額外非對比損失，比CLIP取得了改進。然而，有時很難將這些額外損失對模型訓練中使用的其他實施細節（例如數據增強或正則化技術）的貢獻與其分開。為了闡明這一問題，本文首先提出、實施並評估了通過將對比學習與最近自監督學習的進展相結合獲得的幾個基準線。具體而言，我們使用已被證明對視覺自監督學習成功的損失函數來對齊圖像和文本模態。我們發現這些基準線優於基本的CLIP實現。然而，當應用更強的訓練配方時，這種優勢就消失了。事實上，我們發現一個簡單的CLIP基準線也可以顯著改進，最多可在下游零樣本任務上提高25%的相對改進，方法是使用在其他子領域中流行的眾所周知的訓練技術。此外，我們發現只需應用圖像和文本增強即可彌補先前工作所獲得的大部分改進。通過我們改進的CLIP訓練配方，我們在四個標準數據集上實現了最先進的性能，並且在簡化設計的同時始終優於先前的工作（在最大數據集上最多提高了+4%），

English

Contrastive learning has emerged as an efficient framework to learn multimodal representations. CLIP, a seminal work in this area, achieved impressive results by training on paired image-text data using the contrastive loss. Recent work claims improvements over CLIP using additional non-contrastive losses inspired from self-supervised learning. However, it is sometimes hard to disentangle the contribution of these additional losses from other implementation details, e.g., data augmentation or regularization techniques, used to train the model. To shed light on this matter, in this paper, we first propose, implement and evaluate several baselines obtained by combining contrastive learning with recent advances in self-supervised learning. In particular, we use the loss functions that were proven successful for visual self-supervised learning to align image and text modalities. We find that these baselines outperform a basic implementation of CLIP. However, when a stronger training recipe is employed, the advantage disappears. Indeed, we find that a simple CLIP baseline can also be improved substantially, up to a 25% relative improvement on downstream zero-shot tasks, by using well-known training techniques that are popular in other subfields. Moreover, we discover that it is enough to apply image and text augmentations to make up for most of the improvement attained by prior works. With our improved training recipe for CLIP, we obtain state-of-the-art performance on four standard datasets, and consistently outperform prior work (up to +4% on the largest dataset), while being substantially simpler.

視覺-語言預訓練的改進基準

Improved baselines for vision-language pre-training

摘要

Support