언어 재구성을 통한 CLIP 학습 개선

초록

대조적 언어-이미지 사전학습(Contrastive Language-Image Pre-training, CLIP)은 이미지와 텍스트 데이터 쌍을 사용하여 전이 가능한 비전 모델을 학습시키는 가장 효과적이고 확장 가능한 방법 중 하나로 자리 잡고 있습니다. CLIP 모델은 대조 손실(contrastive loss)을 사용하여 학습되며, 이는 일반적으로 과적합과 단축 경로(shortcuts)를 방지하기 위해 데이터 증강을 활용합니다. 그러나 CLIP 학습 패러다임에서는 데이터 증강이 이미지 입력에만 적용되고, 언어 입력은 전체 학습 과정 동안 변경되지 않아 동일한 이미지에 대해 다양한 텍스트를 노출시키는 데 한계가 있습니다. 본 논문에서는 언어 재작성(language rewrites)을 통해 CLIP 학습을 향상시키는 간단하면서도 매우 효과적인 접근 방식인 언어 증강 CLIP(Language augmented CLIP, LaCLIP)을 소개합니다. 대규모 언어 모델의 문맥 내 학습(in-context learning) 능력을 활용하여 각 이미지와 연관된 텍스트 설명을 재작성합니다. 이러한 재작성된 텍스트는 원본의 핵심 개념과 의미를 보존하면서 문장 구조와 어휘 측면에서 다양성을 보여줍니다. 학습 과정에서 LaCLIP은 각 이미지에 대해 원본 텍스트 또는 재작성된 버전 중 하나를 무작위로 선택하여 텍스트 증강으로 사용합니다. CC3M, CC12M, RedCaps 및 LAION-400M 데이터셋에 대한 광범위한 실험을 통해 언어 재작성을 통한 CLIP 사전학습이 학습 중 계산 또는 메모리 오버헤드 없이 전이 성능을 크게 향상시킴을 보여줍니다. 특히 ImageNet 제로샷(zero-shot) 정확도에서 LaCLIP은 CC12M에서 CLIP 대비 8.2%, LAION-400M에서 2.4% 더 우수한 성능을 보였습니다. 코드는 https://github.com/LijieFan/LaCLIP에서 확인할 수 있습니다.

English

Contrastive Language-Image Pre-training (CLIP) stands as one of the most effective and scalable methods for training transferable vision models using paired image and text data. CLIP models are trained using contrastive loss, which typically relies on data augmentations to prevent overfitting and shortcuts. However, in the CLIP training paradigm, data augmentations are exclusively applied to image inputs, while language inputs remain unchanged throughout the entire training process, limiting the exposure of diverse texts to the same image. In this paper, we introduce Language augmented CLIP (LaCLIP), a simple yet highly effective approach to enhance CLIP training through language rewrites. Leveraging the in-context learning capability of large language models, we rewrite the text descriptions associated with each image. These rewritten texts exhibit diversity in sentence structure and vocabulary while preserving the original key concepts and meanings. During training, LaCLIP randomly selects either the original texts or the rewritten versions as text augmentations for each image. Extensive experiments on CC3M, CC12M, RedCaps and LAION-400M datasets show that CLIP pre-training with language rewrites significantly improves the transfer performance without computation or memory overhead during training. Specifically for ImageNet zero-shot accuracy, LaCLIP outperforms CLIP by 8.2% on CC12M and 2.4% on LAION-400M. Code is available at https://github.com/LijieFan/LaCLIP.

언어 재구성을 통한 CLIP 학습 개선

Improving CLIP Training with Language Rewrites

초록

Support