利用语言重写改进CLIP训练
Improving CLIP Training with Language Rewrites
May 31, 2023
作者: Lijie Fan, Dilip Krishnan, Phillip Isola, Dina Katabi, Yonglong Tian
cs.AI
摘要
对比语言-图像预训练(CLIP)是一种最有效且可扩展的方法,用于训练可转移的视觉模型,利用成对的图像和文本数据。CLIP模型使用对比损失进行训练,通常依赖于数据增强以防止过拟合和捷径。然而,在CLIP训练范式中,数据增强仅应用于图像输入,而语言输入在整个训练过程中保持不变,限制了向同一图像展示多样文本的曝光。在本文中,我们介绍了一种名为Language augmented CLIP(LaCLIP)的简单而高效的方法,通过语言重写来增强CLIP训练。利用大型语言模型的上下文学习能力,我们重新编写与每个图像相关联的文本描述。这些重新编写的文本在句子结构和词汇上呈现多样性,同时保留原始关键概念和含义。在训练过程中,LaCLIP随机选择原始文本或重新编写版本作为每个图像的文本增强。在CC3M、CC12M、RedCaps和LAION-400M数据集上进行的大量实验表明,使用语言重写的CLIP预训练显著提高了转移性能,而在训练过程中没有计算或内存开销。具体来说,对于ImageNet的零样本准确率,LaCLIP在CC12M上比CLIP提高了8.2%,在LAION-400M上提高了2.4%。代码可在https://github.com/LijieFan/LaCLIP找到。
English
Contrastive Language-Image Pre-training (CLIP) stands as one of the most
effective and scalable methods for training transferable vision models using
paired image and text data. CLIP models are trained using contrastive loss,
which typically relies on data augmentations to prevent overfitting and
shortcuts. However, in the CLIP training paradigm, data augmentations are
exclusively applied to image inputs, while language inputs remain unchanged
throughout the entire training process, limiting the exposure of diverse texts
to the same image. In this paper, we introduce Language augmented CLIP
(LaCLIP), a simple yet highly effective approach to enhance CLIP training
through language rewrites. Leveraging the in-context learning capability of
large language models, we rewrite the text descriptions associated with each
image. These rewritten texts exhibit diversity in sentence structure and
vocabulary while preserving the original key concepts and meanings. During
training, LaCLIP randomly selects either the original texts or the rewritten
versions as text augmentations for each image. Extensive experiments on CC3M,
CC12M, RedCaps and LAION-400M datasets show that CLIP pre-training with
language rewrites significantly improves the transfer performance without
computation or memory overhead during training. Specifically for ImageNet
zero-shot accuracy, LaCLIP outperforms CLIP by 8.2% on CC12M and 2.4% on
LAION-400M. Code is available at https://github.com/LijieFan/LaCLIP.