三塔：具彈性的預訓練圖像模型對比學習

摘要

我們介紹了三塔（3T），一種靈活的方法，通過整合預訓練圖像分類器來改善視覺語言模型的對比學習。儘管對比模型通常是從頭開始訓練的，但LiT（Zhai等，2022）最近展示了使用預訓練分類器嵌入的性能提升。然而，LiT直接用凍結的嵌入替換圖像塔，排除了對比訓練圖像塔的任何潛在好處。通過3T，我們提出了一種更靈活的策略，使圖像塔能夠從預訓練嵌入和對比訓練中受益。為了實現這一目標，我們引入了一個包含凍結預訓練嵌入的第三塔，並鼓勵這第三塔與主要的圖像-文本塔之間的對齊。在實驗上，3T在檢索任務中始終優於LiT和CLIP風格的從頭開始基線。對於分類任務，3T可靠地優於從頭開始的基線，雖然對於JFT預訓練模型，它表現不及LiT，但對於ImageNet-21k和Places365預訓練，它優於LiT。

English

We introduce Three Towers (3T), a flexible method to improve the contrastive learning of vision-language models by incorporating pretrained image classifiers. While contrastive models are usually trained from scratch, LiT (Zhai et al., 2022) has recently shown performance gains from using pretrained classifier embeddings. However, LiT directly replaces the image tower with the frozen embeddings, excluding any potential benefits of contrastively training the image tower. With 3T, we propose a more flexible strategy that allows the image tower to benefit from both pretrained embeddings and contrastive training. To achieve this, we introduce a third tower that contains the frozen pretrained embeddings, and we encourage alignment between this third tower and the main image-text towers. Empirically, 3T consistently improves over LiT and the CLIP-style from-scratch baseline for retrieval tasks. For classification, 3T reliably improves over the from-scratch baseline, and while it underperforms relative to LiT for JFT-pretrained models, it outperforms LiT for ImageNet-21k and Places365 pretraining.

三塔：具彈性的預訓練圖像模型對比學習

Three Towers: Flexible Contrastive Learning with Pretrained Image Models

摘要

Support