세 개의 타워: 사전 학습된 이미지 모델을 활용한 유연한 대조 학습

초록

우리는 사전 학습된 이미지 분류기를 통합하여 시각-언어 모델의 대조 학습(contrastive learning)을 개선할 수 있는 유연한 방법인 Three Towers(3T)를 소개한다. 대조 모델은 일반적으로 처음부터 학습되지만, LiT(Zhai et al., 2022)는 최근 사전 학습된 분류기 임베딩을 사용하여 성능 향상을 보여주었다. 그러나 LiT는 이미지 타워를 고정된 임베딩으로 직접 대체함으로써 이미지 타워의 대조 학습으로부터 얻을 수 있는 잠재적 이점을 배제한다. 3T에서는 이미지 타워가 사전 학습된 임베딩과 대조 학습 모두로부터 이점을 얻을 수 있도록 더 유연한 전략을 제안한다. 이를 위해, 고정된 사전 학습 임베딩을 포함하는 세 번째 타워를 도입하고, 이 세 번째 타워와 주요 이미지-텍스트 타워 간의 정렬을 촉진한다. 실험적으로, 3T는 검색 작업에서 LiT와 CLIP 스타일의 처음부터 학습된 베이스라인을 지속적으로 개선한다. 분류 작업에서는 3T가 처음부터 학습된 베이스라인보다 안정적으로 성능을 향상시키며, JFT 사전 학습 모델에서는 LiT에 비해 성능이 낮지만, ImageNet-21k와 Places365 사전 학습에서는 LiT를 능가한다.

English

We introduce Three Towers (3T), a flexible method to improve the contrastive learning of vision-language models by incorporating pretrained image classifiers. While contrastive models are usually trained from scratch, LiT (Zhai et al., 2022) has recently shown performance gains from using pretrained classifier embeddings. However, LiT directly replaces the image tower with the frozen embeddings, excluding any potential benefits of contrastively training the image tower. With 3T, we propose a more flexible strategy that allows the image tower to benefit from both pretrained embeddings and contrastive training. To achieve this, we introduce a third tower that contains the frozen pretrained embeddings, and we encourage alignment between this third tower and the main image-text towers. Empirically, 3T consistently improves over LiT and the CLIP-style from-scratch baseline for retrieval tasks. For classification, 3T reliably improves over the from-scratch baseline, and while it underperforms relative to LiT for JFT-pretrained models, it outperforms LiT for ImageNet-21k and Places365 pretraining.

세 개의 타워: 사전 학습된 이미지 모델을 활용한 유연한 대조 학습

Three Towers: Flexible Contrastive Learning with Pretrained Image Models

초록

Support