視覺-語言預訓練的改進基準
Improved baselines for vision-language pre-training
May 15, 2023
作者: Enrico Fini, Pietro Astolfi, Adriana Romero-Soriano, Jakob Verbeek, Michal Drozdzal
cs.AI
摘要
對比學習已成為學習多模態表示的有效框架。在這個領域的開創性工作CLIP通過使用對比損失在配對的圖像-文本數據上進行訓練,取得了令人印象深刻的成果。最近的研究聲稱利用從自監督學習中獲得靈感的額外非對比損失,比CLIP取得了改進。然而,有時很難將這些額外損失對模型訓練中使用的其他實施細節(例如數據增強或正則化技術)的貢獻與其分開。為了闡明這一問題,本文首先提出、實施並評估了通過將對比學習與最近自監督學習的進展相結合獲得的幾個基準線。具體而言,我們使用已被證明對視覺自監督學習成功的損失函數來對齊圖像和文本模態。我們發現這些基準線優於基本的CLIP實現。然而,當應用更強的訓練配方時,這種優勢就消失了。事實上,我們發現一個簡單的CLIP基準線也可以顯著改進,最多可在下游零樣本任務上提高25%的相對改進,方法是使用在其他子領域中流行的眾所周知的訓練技術。此外,我們發現只需應用圖像和文本增強即可彌補先前工作所獲得的大部分改進。通過我們改進的CLIP訓練配方,我們在四個標準數據集上實現了最先進的性能,並且在簡化設計的同時始終優於先前的工作(在最大數據集上最多提高了+4%),
English
Contrastive learning has emerged as an efficient framework to learn
multimodal representations. CLIP, a seminal work in this area, achieved
impressive results by training on paired image-text data using the contrastive
loss. Recent work claims improvements over CLIP using additional
non-contrastive losses inspired from self-supervised learning. However, it is
sometimes hard to disentangle the contribution of these additional losses from
other implementation details, e.g., data augmentation or regularization
techniques, used to train the model. To shed light on this matter, in this
paper, we first propose, implement and evaluate several baselines obtained by
combining contrastive learning with recent advances in self-supervised
learning. In particular, we use the loss functions that were proven successful
for visual self-supervised learning to align image and text modalities. We find
that these baselines outperform a basic implementation of CLIP. However, when a
stronger training recipe is employed, the advantage disappears. Indeed, we find
that a simple CLIP baseline can also be improved substantially, up to a 25%
relative improvement on downstream zero-shot tasks, by using well-known
training techniques that are popular in other subfields. Moreover, we discover
that it is enough to apply image and text augmentations to make up for most of
the improvement attained by prior works. With our improved training recipe for
CLIP, we obtain state-of-the-art performance on four standard datasets, and
consistently outperform prior work (up to +4% on the largest dataset), while
being substantially simpler.