비전-언어 사전 학습을 위한 개선된 베이스라인

초록

대조 학습(contrastive learning)은 다중 모달 표현(multimodal representations)을 학습하기 위한 효율적인 프레임워크로 부상했습니다. 이 분야의 선구적인 연구인 CLIP은 대조 손실(contrastive loss)을 사용하여 이미지-텍스트 쌍 데이터를 학습함으로써 인상적인 결과를 달성했습니다. 최근 연구에서는 자기 지도 학습(self-supervised learning)에서 영감을 받은 추가적인 비대조 손실(non-contrastive losses)을 사용하여 CLIP을 개선했다고 주장합니다. 그러나 이러한 추가 손실의 기여를 모델 학습에 사용된 데이터 증강(data augmentation)이나 정규화 기술(regularization techniques)과 같은 다른 구현 세부 사항과 분리하기는 때때로 어렵습니다. 이 문제를 명확히 하기 위해, 본 논문에서는 먼저 대조 학습과 최근 자기 지도 학습의 발전을 결합하여 얻은 여러 베이스라인을 제안, 구현 및 평가합니다. 특히, 시각적 자기 지도 학습에서 성공적으로 입증된 손실 함수를 사용하여 이미지와 텍스트 모달리티를 정렬합니다. 우리는 이러한 베이스라인이 기본 CLIP 구현을 능가한다는 사실을 발견했습니다. 그러나 더 강력한 학습 레시피를 사용할 경우, 이러한 이점은 사라집니다. 실제로, 간단한 CLIP 베이스라인도 다른 하위 분야에서 널리 사용되는 잘 알려진 학습 기술을 적용함으로써 다운스트림 제로샷(zero-shot) 작업에서 최대 25%의 상대적 개선을 달성할 수 있음을 확인했습니다. 또한, 이전 연구에서 달성한 대부분의 개선을 보완하기 위해서는 이미지와 텍스트 증강을 적용하는 것만으로 충분하다는 사실을 발견했습니다. 우리가 개선한 CLIP 학습 레시피를 사용하여, 네 가지 표준 데이터셋에서 최첨단 성능을 달성했으며, 이전 연구를 일관되게 능가했으며(가장 큰 데이터셋에서 최대 +4%), 훨씬 더 간단한 방법을 사용했습니다.

English

Contrastive learning has emerged as an efficient framework to learn multimodal representations. CLIP, a seminal work in this area, achieved impressive results by training on paired image-text data using the contrastive loss. Recent work claims improvements over CLIP using additional non-contrastive losses inspired from self-supervised learning. However, it is sometimes hard to disentangle the contribution of these additional losses from other implementation details, e.g., data augmentation or regularization techniques, used to train the model. To shed light on this matter, in this paper, we first propose, implement and evaluate several baselines obtained by combining contrastive learning with recent advances in self-supervised learning. In particular, we use the loss functions that were proven successful for visual self-supervised learning to align image and text modalities. We find that these baselines outperform a basic implementation of CLIP. However, when a stronger training recipe is employed, the advantage disappears. Indeed, we find that a simple CLIP baseline can also be improved substantially, up to a 25% relative improvement on downstream zero-shot tasks, by using well-known training techniques that are popular in other subfields. Moreover, we discover that it is enough to apply image and text augmentations to make up for most of the improvement attained by prior works. With our improved training recipe for CLIP, we obtain state-of-the-art performance on four standard datasets, and consistently outperform prior work (up to +4% on the largest dataset), while being substantially simpler.

비전-언어 사전 학습을 위한 개선된 베이스라인

Improved baselines for vision-language pre-training

초록

Support