言語書き換えによるCLIPトレーニングの改善

要旨

コントラスティブ・ランゲージ・イメージ事前学習（CLIP）は、ペアになった画像とテキストデータを使用して転移可能な視覚モデルを訓練するための最も効果的でスケーラブルな手法の一つとして知られています。CLIPモデルはコントラスティブ損失を用いて訓練されますが、これは通常、過学習やショートカットを防ぐためにデータ拡張に依存しています。しかし、CLIPの訓練パラダイムでは、データ拡張は画像入力にのみ適用され、言語入力は訓練プロセス全体を通じて変更されないため、同じ画像に対して多様なテキストが提示される機会が限られています。本論文では、言語の書き換えを通じてCLIP訓練を強化する、シンプルでありながら非常に効果的なアプローチであるLanguage augmented CLIP（LaCLIP）を紹介します。大規模言語モデルのインコンテキスト学習能力を活用して、各画像に関連付けられたテキスト記述を書き換えます。これらの書き換えられたテキストは、文構造や語彙の多様性を示しながらも、元のキーコンセプトと意味を保持しています。訓練中、LaCLIPは各画像に対して元のテキストまたは書き換えられたバージョンのいずれかをランダムに選択してテキスト拡張として使用します。CC3M、CC12M、RedCaps、LAION-400Mデータセットでの大規模な実験により、言語の書き換えを伴うCLIP事前学習が、訓練中の計算量やメモリオーバーヘッドを増やすことなく、転移性能を大幅に向上させることが示されました。特にImageNetのゼロショット精度において、LaCLIPはCC12Mで8.2%、LAION-400Mで2.4%の改善を達成しました。コードはhttps://github.com/LijieFan/LaCLIPで公開されています。

English

Contrastive Language-Image Pre-training (CLIP) stands as one of the most effective and scalable methods for training transferable vision models using paired image and text data. CLIP models are trained using contrastive loss, which typically relies on data augmentations to prevent overfitting and shortcuts. However, in the CLIP training paradigm, data augmentations are exclusively applied to image inputs, while language inputs remain unchanged throughout the entire training process, limiting the exposure of diverse texts to the same image. In this paper, we introduce Language augmented CLIP (LaCLIP), a simple yet highly effective approach to enhance CLIP training through language rewrites. Leveraging the in-context learning capability of large language models, we rewrite the text descriptions associated with each image. These rewritten texts exhibit diversity in sentence structure and vocabulary while preserving the original key concepts and meanings. During training, LaCLIP randomly selects either the original texts or the rewritten versions as text augmentations for each image. Extensive experiments on CC3M, CC12M, RedCaps and LAION-400M datasets show that CLIP pre-training with language rewrites significantly improves the transfer performance without computation or memory overhead during training. Specifically for ImageNet zero-shot accuracy, LaCLIP outperforms CLIP by 8.2% on CC12M and 2.4% on LAION-400M. Code is available at https://github.com/LijieFan/LaCLIP.

言語書き換えによるCLIPトレーニングの改善

Improving CLIP Training with Language Rewrites

要旨

Support