三塔:具彈性的預訓練圖像模型對比學習
Three Towers: Flexible Contrastive Learning with Pretrained Image Models
May 26, 2023
作者: Jannik Kossen, Mark Collier, Basil Mustafa, Xiao Wang, Xiaohua Zhai, Lucas Beyer, Andreas Steiner, Jesse Berent, Rodolphe Jenatton, Efi Kokiopoulou
cs.AI
摘要
我們介紹了三塔(3T),一種靈活的方法,通過整合預訓練圖像分類器來改善視覺語言模型的對比學習。儘管對比模型通常是從頭開始訓練的,但LiT(Zhai等,2022)最近展示了使用預訓練分類器嵌入的性能提升。然而,LiT直接用凍結的嵌入替換圖像塔,排除了對比訓練圖像塔的任何潛在好處。通過3T,我們提出了一種更靈活的策略,使圖像塔能夠從預訓練嵌入和對比訓練中受益。為了實現這一目標,我們引入了一個包含凍結預訓練嵌入的第三塔,並鼓勵這第三塔與主要的圖像-文本塔之間的對齊。在實驗上,3T在檢索任務中始終優於LiT和CLIP風格的從頭開始基線。對於分類任務,3T可靠地優於從頭開始的基線,雖然對於JFT預訓練模型,它表現不及LiT,但對於ImageNet-21k和Places365預訓練,它優於LiT。
English
We introduce Three Towers (3T), a flexible method to improve the contrastive
learning of vision-language models by incorporating pretrained image
classifiers. While contrastive models are usually trained from scratch, LiT
(Zhai et al., 2022) has recently shown performance gains from using pretrained
classifier embeddings. However, LiT directly replaces the image tower with the
frozen embeddings, excluding any potential benefits of contrastively training
the image tower. With 3T, we propose a more flexible strategy that allows the
image tower to benefit from both pretrained embeddings and contrastive
training. To achieve this, we introduce a third tower that contains the frozen
pretrained embeddings, and we encourage alignment between this third tower and
the main image-text towers. Empirically, 3T consistently improves over LiT and
the CLIP-style from-scratch baseline for retrieval tasks. For classification,
3T reliably improves over the from-scratch baseline, and while it underperforms
relative to LiT for JFT-pretrained models, it outperforms LiT for ImageNet-21k
and Places365 pretraining.