透過語言重寫來改善 CLIP 訓練

摘要

對比式語言-圖像預訓練（CLIP）是訓練可轉移視覺模型的一種高效且可擴展的方法之一，使用成對的圖像和文本數據。CLIP 模型使用對比損失進行訓練，通常依賴於數據增強來防止過度擬合和捷徑。然而，在 CLIP 訓練範式中，數據增強僅應用於圖像輸入，而語言輸入在整個訓練過程中保持不變，限制了不同文本對同一圖像的曝光。在本文中，我們介紹了一種名為語言增強 CLIP（LaCLIP）的簡單但高效的方法，通過語言重寫來增強 CLIP 訓練。利用大型語言模型的上下文學習能力，我們重寫與每個圖像相關的文本描述。這些重寫的文本在句子結構和詞彙方面呈現多樣性，同時保留原始的關鍵概念和含義。在訓練期間，LaCLIP 隨機選擇原始文本或重寫版本作為每個圖像的文本增強。在 CC3M、CC12M、RedCaps 和 LAION-400M 數據集上進行的大量實驗表明，使用語言重寫的 CLIP 預訓練明顯提高了轉移性能，而在訓練期間沒有計算或內存開銷。特別是對於 ImageNet 零樣本準確度，LaCLIP 在 CC12M 上超越 CLIP 8.2%，在 LAION-400M 上超越 CLIP 2.4%。代碼可在 https://github.com/LijieFan/LaCLIP 找到。

English

Contrastive Language-Image Pre-training (CLIP) stands as one of the most effective and scalable methods for training transferable vision models using paired image and text data. CLIP models are trained using contrastive loss, which typically relies on data augmentations to prevent overfitting and shortcuts. However, in the CLIP training paradigm, data augmentations are exclusively applied to image inputs, while language inputs remain unchanged throughout the entire training process, limiting the exposure of diverse texts to the same image. In this paper, we introduce Language augmented CLIP (LaCLIP), a simple yet highly effective approach to enhance CLIP training through language rewrites. Leveraging the in-context learning capability of large language models, we rewrite the text descriptions associated with each image. These rewritten texts exhibit diversity in sentence structure and vocabulary while preserving the original key concepts and meanings. During training, LaCLIP randomly selects either the original texts or the rewritten versions as text augmentations for each image. Extensive experiments on CC3M, CC12M, RedCaps and LAION-400M datasets show that CLIP pre-training with language rewrites significantly improves the transfer performance without computation or memory overhead during training. Specifically for ImageNet zero-shot accuracy, LaCLIP outperforms CLIP by 8.2% on CC12M and 2.4% on LAION-400M. Code is available at https://github.com/LijieFan/LaCLIP.

透過語言重寫來改善 CLIP 訓練

Improving CLIP Training with Language Rewrites

摘要

Support